The following casting rules are now the default for MySQL:
- type tinyint when unsigned to smallint drop typemod
- type smallint when unsigned to integer drop typemod
- type mediumint when unsigned to integer drop typemod
- type integer when unsigned to bigint drop typemod
Fixes#678.
MySQL allows using unsigned data types and pgloader should then target a
signed type of a larger capacity so that values can fit. For example, the
data definition “smallint(5) unsigned” should be casted to “integer”.
This patch allows user defined cast rules to be written against “unsigned”
data types as per their MySQL catalog representation.
See #678.
Google Cloud SQL instances are now using the following format for the name
of their socket <PROJECT-ID>:<REGION>:<INSTANCE_NAME>. We do that by
allowing to escape a colon in the socket directory name by doubling it, as
in the username field. It also allows to accept any character in the socket
directory name, which is a good cleanup.
Fix#621.
The implementation follows PostgreSQL specifications as closely as possible,
with the escaping rules and the matching rules. The default path where to
find the .pgpass (or pgpass.conf on windows) are as documented in PostgreSQL
too. Only missing are the file permissions check.
Fix#460.
It used to be that you would give the target table name as an option to the
PostgreSQL connection string, which is untasteful:
load ... into pgsql://user@host/dbname?tablename=foo.bar ...
Or even, for backwards compatibility:
load ... into pgsql://user@host/dbname?foo.bar ...
The new syntax makes provision for a separate clause for the target table
name, possibly schema-qualified:
load ... into pgsql://user@host/dbname target table foo.bar ...
Which is much better, in particular when used together with the target
columns clause.
Implementing this seemingly quite small feature had impact on many parsing
related features of pgloader, such as the regression testing facility. So
much so that some extra refactoring got into its way here, around the
lisp-code-for-loading-from-<source> functions and their usage in
`load-data'.
While at it, this patch simplifies a lot the `load-data' function by making
a good use of &allow-other-keys and :allow-other-keys t.
Finally, this patch splits main.lisp into main.lisp and api.lisp, with the
latter intended to contain functions for Common Lisp programs wanting to use
pgloader as a library. The API itself is still the same as before this
patch, tho. Just in another file for clarity.
This allows to bypass SSL when you don't need it, like over localhost for
instance. Takes the same syntax as the PostgreSQL sslmode connection string
parameter.
This feature has been asked several times, and I can't see any way to fix
the GETENV parsing mess that we have. In this patch the GETENV support is
retired and replaced with a templating system, using the Mustache syntax.
To get back the GETENV feature, our implementation of the Mustache template
system adds support for fetching the template variable values from the OS
environment.
Fixes#555, Fixes#609.
See #500, #477, #278.
The with option “include drop” used to also apply to schemas, which is not
that useful and problematic when trying to DROP SCHEMA public, because you
might not connect as the owner of that schema.
Even if we don't target the public schema by default, users can choose to do
so thanks to our ALTER SCHEMA ... RENAME TO ... command.
Fixes#594.
As we know how many columns we expect from the input file, it's possible to
read a sample (10 lines as of this patch) and try many different CSV reader
parameters combinations until we find one that works: it returns the right
number of fields.
It is still possible of course to specify parameters on the command line or
in a load file if necessary, but it makes the simple case even simpler. As
simple as:
pgloader file.csv pgsql:///pgloader?tablename=target
This allows to use a combination of "data only, drop indexes" so that when
the target database already exists, pgloader will use the existing schema
and still DROP INDEX before loading the data and do the CREATE INDEX dance
in parallel and all at the end of it.
Also, as I couldn't reproduce neither #539 (which is good, it's supposed to
be fixed now) nor #550 (that was open due to a regression): fixes#550.
Experiment with the idea of splitting the read work in several concurrent
threads, where each reader is reading portions of the target table, using a
WHERE id <= x and id > y clause in its SELECT query.
For this to kick-in a number of conditions needs to be met, as described in
the documentation. The main interest might not be faster queries to overall
fetch the same data set, but better concurrency with as many readers as
writters and each couple its own dedicated queue.
The previous patch made format-vector-row allocate its memory in one go
rather than byte after byte with vector-push-extend. In this patch we review
our usage of batches and parallelism.
Now the reader pushes each row directly to the lparallel queue and writers
concurrently consume from it, cook batches in COPY format, and then send
that chunk of data down to PostgreSQL. When looking at runtime profiles, the
time spent writing in PostgreSQL is a fraction of the time spent reading
from MySQL, so we consider that the writing thread has enough time to do the
data mungling without slowing us down.
The most interesting factor here is the memory behavor of pgloader, which
seems more stable than before, and easier to cope with for SBCL's GC.
Note that batch concurrency is no more, replaced by prefetch rows: the
reader thread no longer build batches and the count of items in the reader
queue is now a number a rows, not of batches of them.
Anyway, with this patch in I can't reproduce the following issues:
Fixes#337, Fixes#420.
pgloader had support for PostgreSQL SET parameters (gucs) from the
beginning, and in the same vein it might be necessary to tweak MySQL
connection parameters, and allow pgloader users to control them.
See #337 and #420 where net_read_timeout and net_write_timeout might need to
be set in order to be able to complete the migration, due to high volumes of
data being processed.
In PostgreSQL it is possible at CREATE TABLE time to set some extra storage
parameters, the most useful of them in the context of pgloader being the
FILLFACTOR. For the setting to be useful, it needs to be positionned at
CREATE TABLE time, before we load the data.
The BEFORE LOAD clause of the pgloader command allows to run SQL scripts
that will be executed before the load, and even before the creation of the
target schema when pgloader does that, which is nice for other use case.
Here we implement a new `ALTER TABLE` rule that one can set in the pgloader
command in order to change storage parameters at CREATE TABLE time:
ALTER TABLE NAMES MATCHING ~/\./ SET (fillfactor='40')
Fix#516.
It turns out that it's possible and not too complex, when using the
FreeTDS driver, to enforce the client encoding for MS SQL to be utf-8.
Document how to tweak ~/.freetds.conf to that end.
The code comment displayed in the release notes for 3.3.1 is reported to
be better at explaining the concurrency control than what we had in the
main documentation, so add it there.
Fix#496.
Also known as the ORM case, it happens that other tools are used to
create the target schema. In that case pgloader job is to fill in the
exiting target tables with the data from the source tables.
We still focus on load speed and pgloader will now DROP the
constraints (Primary Key, Unique, Foreign Keys) and indexes before
running the COPY statements, and re-install the schema it found in the
target database once the data load is done.
This behavior is activated when using the “create no tables” option as
in the following test-case setup:
with create no tables, include drop, truncate
Fixes#400, for which I got a test-case to play with!
This format of source file specifications is available for CSV, COPY and
FIXED formats but was only documented for the CSV one. The paragraph is
copy/pasted around in the hope to produce per-format man pages and web
documentation in a fully automated way sometime.
Fix#397.
By default, pgloader will start as many parallel CREATE INDEX commands
as the maximum number of indexes you have on any single table that takes
part in the load.
As this number might be so great as to exhaust the target PostgreSQL
server (e.g. maintenance_work_mem), we add an option to limit that to
something reasonnable when the source schema isn't.
Fix#386 in which 150 indexes are found on a single source table.
The new ALTER TABLE facility allows to act on tables found in the MySQL
database before the migration happens. In this patch the only provided
actions are RENAME TO and SET SCHEMA, which fixes#224.
In order to be able to provide the same option for MS SQL users, we will
have to make it work at the SCHEMA level (ALTER SCHEMA ... RENAME TO
...) and modify the internal schema-struct so that the schema slot of
our table instances are a schema instance rather than its name.
Lacking MS SQL test database and instance, the facility is not yet
provided for that source type.
More than the syntax and API tweaks, this patch also make it so that a
multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be
loaded with several files in the group in parallel.
To that effect, tweak again the md-connection and md-copy
implementations.
Add the workers and concurrency settings to the LOAD commands for
database sources so that users can tweak them now, and add mentions of
them in the documentation too.
From the documentation string of the copy-from method as found in
src/sources/common/methods.lisp:
We allow WORKER-COUNT simultaneous workers to be active at the same time
in the context of this COPY object. A single unit of work consist of
several kinds of workers:
- a reader getting raw data from the COPY source with `map-rows',
- N transformers preparing raw data for PostgreSQL COPY protocol,
- N writers sending the data down to PostgreSQL.
The N here is setup to the CONCURRENCY parameter: with a CONCURRENCY of
2, we start (+ 1 2 2) = 5 concurrent tasks, with a CONCURRENCY of 4 we
start (+ 1 4 4) = 9 concurrent tasks, of which only WORKER-COUNT may be
active simultaneously.
Those options should find their way in the remaining sources, that's for
a follow-up patch tho.
Filter the list of tables we migrate directly from the SQLite query,
avoiding to return useless data. To do that, use the LIKE pattern
matching supported by SQLite, where the REGEX operator is only available
when extra features are loaded apparently.
See #310 where filtering out the view still caused errors in the
loading.
It's now possible to use several files in a BEFORE LOAD EXECUTE section,
and to mix DO and EXECUTE parts, bringing lots of flexibility in the
commands. Also it actually simplifies the parser.
MySQL names its primary keys "PRIMARY" and we need to always uniquify
this name even when the used asked pgloader to preserve index names.
Also, the create-indexes-again function now needs to ask for index names
to be preserved specifically.
When loading against a table that already has index definitions, the
load can be quite slow. Previous commit introduced a warning in such a
case. This commit introduces the option "drop indexes" that is not used
by default.
When this option is used, pgloader drops the indexes before loading the
data then create the indexes again with the same definitions as before.
All the indexes are created again in parallel to optimize performances.
Only primary key indexes can't be created in parallel, so those are
created in two steps (create unique index then alter table).
See test/parse/hans.goeuro.load for an example usage of the new option.
In passing, any error when creating indexes is now properly reported and
logged, which was missing previously. Oops.
This option is dangerous and allows to skip ALL triggers when loading
data against PostgreSQL. This includes foreign key constraints
definitions and will allow loading data out of order.
When using both the options "create no table" and "disable triggers" it
will be possible to load data into a schema prepared by your favorite
external tool, at the cost of not validating FK constraints. Use with
care.
Fix#167.
Also augment the documentation with examples of bare stdin reading and
of advantages of the unix pipes to stream even remove archived content
down to PostgreSQL.
Make it so that the following command line usages are accepted when
using pgloader without a command file:
./build/bin/pgloader ./test/sqlite/sqlite.db postgresql:///pgloader
./build/bin/pgloader --set "search_path='sakila'" \
mysql://root@localhost/sakila \
postgresql:///sakila
./build/bin/pgloader --type csv \
--field id --field field \
--with truncate \
--with "fields terminated by ','" \
./test/data/matching-1.csv \
postgres:///pgloader?matching
It's now possible in most cases to just use command-line options, which
should make the entry bar to pgloader much lower.