After all, it's shared between the CSV command parsing and the Cast
Rules parsing. src/parsers/command-csv.lisp still contains lots of
facilities shared between the file based sources, will need another
series of splits.
Commit 598c860cf5 broke user defined
casting rules by interning "precision" and "scale" in the
pgloader.user-symbols package: those symbols need to be found in the
pgloader.transforms package instead.
Luckily enough the infrastructure to do that was already in place for
cl:nil.
In order to later be able to have more worker threads sharing the
load (multiple readers and/or writers, maybe more specialized threads
too), have all the stats be managed centrally by a single thread. We
already have a "monitor" thread that get passed log messages so that the
output buffer is not subject to race conditions, extend its use to also
deal with statistics messages.
In the current code, we send a message each time we read a row. In some
future commits we should probably reduce the messaging here to something
like one message per batch in the common case.
Also, as a nice side effect of the code simplification and refactoring
this fixes#283 wherein the before/after sections of individual CSV
files within an ARCHIVE command where not counted in the reporting.
To be able to use "t" (or "nil") as a column name, pgloader needs to be
able to generate lisp code where those symbols are available. It's
simple enough in that a Common Lisp package that doesn't :use :cl
fullfills the condition, so intern user symbols in a specially crafted
package that doesn't :use :cl.
Now, we still need to be able to run transformation code that is using
the :cl package symbols and the pgloader.transforms functions too. In
this commit we introduce a heuristic to pick symbols either as functions
from pgloader.transforms or anything else in pgloader.user-symbols.
And so that user code may use NIL too, we provide an override mechanism
to the intern-symbol heuristic and use it only when parsing user code,
not when producing Common Lisp code from the parsed load command.
When the notion of a connection class with a generic set of method was
invented, the very flexible specification formats available for the file
based sources where not integrated into the new connection system.
This patch provides a new connection class md-connection with a specific
sub-protocol (after opening a connection, the caller is supposed to loop
around open-next-stream) so that it's possible to both properly fit into
the connection concept and to better share the code in between our three
implementation (csv, copy, fixed).
The dry run option will currently only check database connections, but
as that happens after having correctly parsed the load file, it allows
to also check that the command file is correct for the parser.
Note that the list load-data API isn't subject to the dry-run method.
In passing, we add some more API entry points to the connection objects
and we should actually clean the code base to use the new QUERY generic
all over the place. It's for another patch tho.
The new option 'drop indexes' reuses the existing code to build all the
indexes in parallel but failed to properly account for that fact in the
summary report with timings.
While fixing this, also fix the SQL used to re-establish the indexes and
associated constraints to allow for parallel execution, the ALTER TABLE
statements would block in ACCESS EXCLUSIVE MODE otherwise and make our
efforts vain.
When loading against a table that already has index definitions, the
load can be quite slow. Previous commit introduced a warning in such a
case. This commit introduces the option "drop indexes" that is not used
by default.
When this option is used, pgloader drops the indexes before loading the
data then create the indexes again with the same definitions as before.
All the indexes are created again in parallel to optimize performances.
Only primary key indexes can't be created in parallel, so those are
created in two steps (create unique index then alter table).
Some CSV files are using the CSV escape character internally in their
fields. In that case we enter a parsing bug in cl-csv where backtracking
from parsing the escape string isn't possible (or at least
unimplemented).
To handle the case, change the quote parameter from \" to just \ and let
cl-csv use its escape-quote mechanism to decide if we're escaping only
separators or just any data.
See https://github.com/AccelerationNet/cl-csv/issues/17 where the escape
mode feature was introduced for pgloader issue #80 already.
To allow for importing JSON one-liners as-is in the database it can be
interesting to leverage the CSV parser in a compatible setup. That setup
requires being able to use any separator character as the escape
character.
Some CSV files are given with an header line containing the list of
their column names, use that when given the option "csv header".
Note that when both "skip header" and "csv header" options are used,
pgloader first skip as many required lines and then uses the next one as
the csv header.
Because of temporary failure to install the `ronn` documentation tool,
this patch only commits the changes to the source docs and omits to
update the man page (pgloader.1). A following patch is intended to be
pushed that fixed that.
See #236 which is using shell tricks to retrieve the field list from the
CSV file itself and motivated this patch to finally get written.
We used to parse qualified table names as a simple string, which then
breaks attempts to be smart about how to quote idenfifiers. Some sources
are known to accept dots in quoted table names and we need to be able to
process that properly without tripping on qualified table names too
late.
Current code might not be the best approach as it's just using either a
cons or a string for table names internally, rather than defining a
proper data structure with a schema and a name slot.
Well, that's for a later cleanup patch, I happen to be lazy tonight.
This option is dangerous and allows to skip ALL triggers when loading
data against PostgreSQL. This includes foreign key constraints
definitions and will allow loading data out of order.
When using both the options "create no table" and "disable triggers" it
will be possible to load data into a schema prepared by your favorite
external tool, at the cost of not validating FK constraints. Use with
care.
Fix#167.
In passing also allow --field to specify the whole field list, there's
no point in forcing the user to have as many --field switches on the
command line as they have columns in their data source file.
That's the big refactoring patch I've been sitting on for too long.
First, refactor connection handling to use a uniformed "connection"
concept (class and generic functions API) everywhere, so that the COPY
derived objects just use that in their :source-db and :target-db slots.
Given that, we don't need no messing around with *pgconn* and *myconn-*
and other special variables at all anywhere in the tree.
Second, clean up some oddities accumulated over time, where some parts
of the code didn't get the memo when new API got into place.
Third, fix any other oddity or missing part found while doing those
first two activities, it was long overdue anyway...
Make it so that the following command line usages are accepted when
using pgloader without a command file:
./build/bin/pgloader ./test/sqlite/sqlite.db postgresql:///pgloader
./build/bin/pgloader --set "search_path='sakila'" \
mysql://root@localhost/sakila \
postgresql:///sakila
./build/bin/pgloader --type csv \
--field id --field field \
--with truncate \
--with "fields terminated by ','" \
./test/data/matching-1.csv \
postgres:///pgloader?matching
It's now possible in most cases to just use command-line options, which
should make the entry bar to pgloader much lower.