When in :data logging mode we log the whole data set as we read then
write it, which is quite a lot of data. Our current logging system works
by filling up a queue that the cl-log lib is then fed from, and sending
lots of data in that queue is way expensive, stop doing that.
Hopefully we don't need to revisit the logs more than that, the other
messages should be few enough not to count much when doing a full load.
Turns out that in cases it's not possible to call format-vector-row on
MySQL result sets, because it's been sending us vector of bytes (blob)
while the expected data (from the table definition) clearly is text.
Handle the error as an input reading error, skipping the line and being
verbose about it in the logs. This patch fails to update the stats about
what's happening because, so might need later changes.
With the new internal setting *copy-batch-size* it's now possible to
instruct pgloader to close batches early (before *copy-batch-rows* limit)
when crossing the byte count threshold.
When set to 20 MB it allows the new test case (exhausted) to pass under SBCL
and CCL, and there's no measurable cost when *copy-batch-size* is set to
nil (its default value) in the testing done.
This patch is published without any way to tune the values from the command
language yet, that's the next step once its been proven effective.
With this patch, the whole data massaging and final formating into the
PostgreSQL COPY TEXT format is done by the reader thread, which publishes a
batch at a time in the communication channel: a lparallel.queue object.
Before that, the raw vectors where pushed directly in the queue, offering
more flexibility to adjust to the reader and writer IO rates and
capabilities, but impeding the ability of the Garbage Collector: data still
in the queue was not collected even if not needed anymore.
The new model also uses less memory, and allows a better control over what
amount of data stays in memory. The new *concurrent-batches* parameter
should be key to being able to process huge rows.
The intent is to offering a way for the users to tune *concurrent-batches*
down to 1 for sources with massive per-row memory footprint. Even better
would be to find a way to automatically adjust the setting without spending
too much time counting the bytes we're batching.
Preliminary tests show no sensible impact on performances from this patch,
even some improvements in cases.