When parsing table names in the target URI, we are careful of splitting
the table and schema name and store them into a cons in that case. Not
all sources methods got the memo, clean that up.
See #182 and #186, a pull request I am now going to be able to accept.
Also see #287 that should be helped by being able to apply #186.
We used to parse qualified table names as a simple string, which then
breaks attempts to be smart about how to quote idenfifiers. Some sources
are known to accept dots in quoted table names and we need to be able to
process that properly without tripping on qualified table names too
late.
Current code might not be the best approach as it's just using either a
cons or a string for table names internally, rather than defining a
proper data structure with a schema and a name slot.
Well, that's for a later cleanup patch, I happen to be lazy tonight.
This option is dangerous and allows to skip ALL triggers when loading
data against PostgreSQL. This includes foreign key constraints
definitions and will allow loading data out of order.
When using both the options "create no table" and "disable triggers" it
will be possible to load data into a schema prepared by your favorite
external tool, at the cost of not validating FK constraints. Use with
care.
Fix#167.
That's the big refactoring patch I've been sitting on for too long.
First, refactor connection handling to use a uniformed "connection"
concept (class and generic functions API) everywhere, so that the COPY
derived objects just use that in their :source-db and :target-db slots.
Given that, we don't need no messing around with *pgconn* and *myconn-*
and other special variables at all anywhere in the tree.
Second, clean up some oddities accumulated over time, where some parts
of the code didn't get the memo when new API got into place.
Third, fix any other oddity or missing part found while doing those
first two activities, it was long overdue anyway...
When adding the CONTEXT message parsing I totally forgot that PostgreSQL
provides a nice error message translation capability. The code now copes
better with the situation, using a more advanced regular expression.
We could inline the known translations in the matching, but that would
be tedious to maintain, so we just use loose matching rules here.
When in :data logging mode we log the whole data set as we read then
write it, which is quite a lot of data. Our current logging system works
by filling up a queue that the cl-log lib is then fed from, and sending
lots of data in that queue is way expensive, stop doing that.
Hopefully we don't need to revisit the logs more than that, the other
messages should be few enough not to count much when doing a full load.
With the new internal setting *copy-batch-size* it's now possible to
instruct pgloader to close batches early (before *copy-batch-rows* limit)
when crossing the byte count threshold.
When set to 20 MB it allows the new test case (exhausted) to pass under SBCL
and CCL, and there's no measurable cost when *copy-batch-size* is set to
nil (its default value) in the testing done.
This patch is published without any way to tune the values from the command
language yet, that's the next step once its been proven effective.
With this patch, the whole data massaging and final formating into the
PostgreSQL COPY TEXT format is done by the reader thread, which publishes a
batch at a time in the communication channel: a lparallel.queue object.
Before that, the raw vectors where pushed directly in the queue, offering
more flexibility to adjust to the reader and writer IO rates and
capabilities, but impeding the ability of the Garbage Collector: data still
in the queue was not collected even if not needed anymore.
The new model also uses less memory, and allows a better control over what
amount of data stays in memory. The new *concurrent-batches* parameter
should be key to being able to process huge rows.
The intent is to offering a way for the users to tune *concurrent-batches*
down to 1 for sources with massive per-row memory footprint. Even better
would be to find a way to automatically adjust the setting without spending
too much time counting the bytes we're batching.
Preliminary tests show no sensible impact on performances from this patch,
even some improvements in cases.
This message has the line number where the erroneous data was found on the
server, and given the pre-processing we already done at that point, it's
easy to convert that number into an index into the current batch, an array.
To do do, we need Postmodern to expose the CONTEXT error message and we need
to parse it. The following pull request cares about the Postmodern side of
things:
https://github.com/marijnh/Postmodern/pull/46
The parsing is done as simply as possible, only assuming that the error
message is using comma separators and having the line number in second
position. The parsing as done here should still work with localized message
strings.
CONTEXT: COPY errors, line 3, column b: "2006-13-11"
This change should significantly reduce the cost of error processing.