At the moment it's a very manual process, and it might get automated
someday. Meanwhile it's still useful to have.
See #569 for an issue that got a test case added.
PostgreSQL btree indexes are limited in the size of the values they can
index: values must fit in an index page (8kB). So when porting a MySQL full
text index over full documents, we might get into an error like the
following:
index row size 2872 exceeds maximum 2712 for index "idx_5199509_search"
To fix, query MySQL for the index type which is FULLTEXT rather than BTREE
in those cases, and port it over to a PostgreSQL Full Text index with an
hard-coded 'simple' configuration, such as the following test case:
CREATE INDEX idx_75421_search ON mysql.fcm_batches USING gin(to_tsvector('simple', raw_payload));
Of course users might want to use a better configuration, including proper
dictionnary for the documents. When using PostgreSQL each document may have
its own configuration attached and yet they can all get indexed into the
same index, so that's a task for the application developpers, not for
pgloader.
In passing, fix the list-typenames-without-btree-support.sql query to return
separate entries for each index type rather than an {array,representation}
of the result, as Postmodern won't turn the PostgreSQL array into a Common
Lisp array by default. I'm kept wondering how it worked before.
Fix#569.
Adjust the default value formating to check if the default value is already
single-quoted and only add new 'single quotes' when it's not the case.
Apparently ENUM default values in MariaDB 10 are now properly single quoted.
When we already have the information in the pgloader internal catalogs,
don't issue another MySQL query. In this case, it's been used to fetch the
list of columns and their data types so that we can choose to send either
`colname` or maybe astext(`colname`) as `colname` for some geographic types.
That's one less MySQL query per table.
The individual CAST decisions are visible in the CREATE TABLE statements
that are logged a moment later. Also, calling `format-create-sql' on a
column definition that's not finished to be casted will process default
values before they get normalized, and issue WARNING to the poor user.
Not helpful. Bye.
In passing, add the identifiers case option to SQLite support, which makes
it easier to test here, and add a table named "TableName" to our local test
database.
Fix#631.
In travis environment we still test with PostgreSQL 9.1 and 9.6, and there's
no reason for this test to use a modern spelling of create schema, after
all.
It works because the test/csv-before-after.load creates the schema and is
ran before test/csv-guess.load. That's good enough for now.
Previous patch made regression failures obvious that were hidden by strange
bugs with CCL.
One such regression was introduced in commit
ab7e77c2d0 where we played with the complex
code generation for field projection, where the following two cases weren't
cleanly processed anymore:
column text using "constant"
column text using "field-name"
In the first case we want to load a user-defined constant in the column, in
the second case we want to load the value of the field "field-name" in the
column --- we just have different source and target names.
Another regression was introduced in the recent commit
01e5c23763 where the create-table function was
called too early, before we have fetched *pgsql-reserved-keywords*. As a
consequence table names weren't always properly quoted as shown in the
test/csv-header.load file which targets a table named "group".
Finally, skip the test/dbf.load regression test when using CCL as this
environment doesn't have the necessary CP850 code page / encoding.
Due to errors in regression testing when using CCL, review this part of
pgloader. It turns out that cl-log:stop-messenger on a text-stream-messenger
closes the stream, which isn't a good idea when given *standard-output*.
At least it makes CCL chokes when it then wants to output something of its
own, such as when running in --batch mode (which is nice because it outputs
more diagnostic information).
To solve that problem, initialize the text-stream-messenger with a broadcast
stream made from *standard-output*, which we now may close at will.
The parser files don't depend on the sources, it's the other way round
nowadays. Also, the responsability to decipher the *sections* context should
be restricted to the monitor.lisp file, which is now the case.
And this time, fix#628 for real.
It seems that when compiling with CCL in “batch” mode, that is using
buildapp, the local symbol exporting facility didn't work at all. It needs
to be run at load time so that the compiler sees the symbols.
Fix#628.
Google Cloud SQL instances are now using the following format for the name
of their socket <PROJECT-ID>:<REGION>:<INSTANCE_NAME>. We do that by
allowing to escape a colon in the socket directory name by doubling it, as
in the username field. It also allows to accept any character in the socket
directory name, which is a good cleanup.
Fix#621.
The implementation follows PostgreSQL specifications as closely as possible,
with the escaping rules and the matching rules. The default path where to
find the .pgpass (or pgpass.conf on windows) are as documented in PostgreSQL
too. Only missing are the file permissions check.
Fix#460.
Some code was pasted twice in src/api.lisp, and a defstruct with no slots
isn't spelled the way I did in previous patches. We use a defstruct with no
slots for defining a hierarchy on which to dispatch our pretty-print
function.
It used to be that you would give the target table name as an option to the
PostgreSQL connection string, which is untasteful:
load ... into pgsql://user@host/dbname?tablename=foo.bar ...
Or even, for backwards compatibility:
load ... into pgsql://user@host/dbname?foo.bar ...
The new syntax makes provision for a separate clause for the target table
name, possibly schema-qualified:
load ... into pgsql://user@host/dbname target table foo.bar ...
Which is much better, in particular when used together with the target
columns clause.
Implementing this seemingly quite small feature had impact on many parsing
related features of pgloader, such as the regression testing facility. So
much so that some extra refactoring got into its way here, around the
lisp-code-for-loading-from-<source> functions and their usage in
`load-data'.
While at it, this patch simplifies a lot the `load-data' function by making
a good use of &allow-other-keys and :allow-other-keys t.
Finally, this patch splits main.lisp into main.lisp and api.lisp, with the
latter intended to contain functions for Common Lisp programs wanting to use
pgloader as a library. The API itself is still the same as before this
patch, tho. Just in another file for clarity.
In the previous commit we introduced support for database names including
spaces, which means that by default pgloader creates a target schema in
PostgreSQL with a space in its name. That works well as soon as you always
double-quote the schema name, which pgloader does.
Now, in our internal catalogs, we keep the schema name double-quoted. And
when comparing that schema names with quotes to the raw schema name from
PostgreSQL, they won't match, and pgloader tries to create the schema again:
ERROR Database error 42P06: schema "my sql" already exists
Fix the comparing to compare unquoted schema name, fix#614 again: the
previous fix would only work the first time.
This allows to bypass SSL when you don't need it, like over localhost for
instance. Takes the same syntax as the PostgreSQL sslmode connection string
parameter.
In this commit we fail the guess faster, allowing to test for a much larger
sample. The sample is still hard-coded, but this time to 1000 lines.
Also add a test case, see #618.
Understanding the timings requires not only the number of rows copied into
each table but also how many bytes that represent. We add that information
now in tht output.
The number of bytes presented is computed from the unicode representation we
prepare in pgloader for each row before sending it down to PostgreSQL.
Use a generic function protocol in order to implement the human readable,
verbose, csv, copy and json reporting output formats. This is much cleaner
and extensible than the previous way.
Use that new power to implement a real JSON output from the internal state
object.
It seems that SBCL still needs some help in deciding when to GC with very
large values. In a test case with a “data” column averaging 375kB (up to
about 3 MB per datum), it allows much larger batch size and prefetch rows
settings without entering lldb.
The pgstate infrastructure already had lots of details about what's going
on, add to it the information about how many bytes are sent in every batch,
and use this information in the monitor when something long is happening to
display how many rows we sent from the beginning for this (supposedly) huge
table, along with bytes and speed (bytes per seconds).
Add a message every 20 batches so that the user knows it's still going on.
Also, in passing, fix some messages: present is not precise enough to decide
if the log refers to an event that is being done or starting next.
The verbosity is not that easy to adjust. Remove useless messages and add a
new one telling when the COPY of a table is done. As we might have to wait
for some time for indexes being built. keep the CREATE INDEX lines. Also
keep the ALTER TABLE both for primary keys and foreign keys, again because
the user might have to wait for quite some time.
This feature has been asked several times, and I can't see any way to fix
the GETENV parsing mess that we have. In this patch the GETENV support is
retired and replaced with a templating system, using the Mustache syntax.
To get back the GETENV feature, our implementation of the Mustache template
system adds support for fetching the template variable values from the OS
environment.
Fixes#555, Fixes#609.
See #500, #477, #278.
It is sometimes needed to tweak MS SQL server parameters, such as the
textsize parameters which allows fetching the whole set of bytes of a text
of binary column (not kidding).
Now it's possible to add such a line in the load file:
set mssql parameters textsize to '104857600'
Fixes#603.
This code path is exercised from the command line only, which means I don't
get to run it that often. And it's a pain to debug. So make it easier to run
`process-source-and-target` from the REPL.
Startup log messages could be lost because the monitor would be started but
not ready to process messages. Fix that by “warming up” the monitoring
thread, having it execute a small computation and more importantly wait for
the result to be received back, blocking.
See #599 where parsing errors from a wrong URL were missed in the command
line output, quite disturbingly.
The previous patch fixed CREATE TYPE so that ENUM types are created in the
same schema than the table using them, but failed to update the DROP TYPE
statements to also target this schema...
Previously to this patch, pgloader wouldn't care about which schema it
creates extra types in. Extra types are mainly ENUM and SET support from
MySQL. Now, pgloader creates those extra PostgreSQL ENUM types in the same
schema as the table using them, which is a more sound default.
The MySQL enum are casted to PostgreSQL enum types just fine, but sometimes
that's not what the user wants. In case when we have a CAST rule for an ENUM
column, recognize the fact and respect user choice.
Fixes#608.
The spelling in SQLite for the default value is "current_date", instruct
pgloader about that. This commit also adds a test case in our sqlite.db
unit tests database.
Fixes#607.
Now when building a bundle file for source distribution of pgloader, always
test it by building a binary image from the bundle tarball in a test
directory. Also make it easy to target "latest" Quicklisp distribution with
the following spelling:
make BUNDLEDIST=latest bundle
When distributing a pgloader bundle we're using the ql-dist facility. In
recent commit we hand-picked the last known working distribution of
quicklisp for pgloader. Make it easy to target "latest" known distribution
or hard-code one from the Makefile or the bundle/ql.lisp file.
We use a CSV parser for the MySQL enum values, but the quote escaping wasn't
properly setup: MySQL quotes ENUM values with a single-quote (') and uses
two of them ('') for escaping single-quotes when found in the ENUM value
itself.
Fixes#597.
When the table is empty we get nil for min and max values of the id column.
In that case we don't compute a set of ranges and “cancel” concurrency
support for the empty table.
Fixes#596.
The with option “include drop” used to also apply to schemas, which is not
that useful and problematic when trying to DROP SCHEMA public, because you
might not connect as the owner of that schema.
Even if we don't target the public schema by default, users can choose to do
so thanks to our ALTER SCHEMA ... RENAME TO ... command.
Fixes#594.