When default values are used in SQLite they are of course using their
SQLite representation, which might not be compatible with the PostgreSQL
target data type we're casting to. Make it so that the default values
are transformed too, as we already do in the MySQL case.
See #100.
In SQLite it's possible to define columns using type names such as
"smallint unsigned" or "short integer", without any changes to the way
those data types are handled, given its "dynamic typing" features.
Improve the pgloader casting machinery for SQLite to handle those cases.
The truncate command is only sent to PostgreSQL when we didn't just
CREATE TABLE before. Some refactoring would be necessary to fit the
TRUNCATE command within the same transaction as the CREATE TABLE
command, for PostgreSQL performances.
This patch has been testing with MySQL and SQLite sources, the trick is
that to be able to test it, it's needed to first make a full
import (creating the target tables), so the test are not modified yet.
When using SQLite 3, a blob column might return either string of byte
vector values dynamically depending on the data itself, or maybe some
more complex parameters controlled at data insert time.
Hard-code the rule that a blob column returned as a string is in fact
base64 encoded (which looks like common practice) and decode it
automatically when needed, before sending to byte-vector-to-bytea. It
might be a tad slow but at least the data is properly converted.
In future, that decision might come and byte us in the back again, at
which point it'll be necessary to consider full casting options as in
the MySQL CAST rules. It seems like a big enough win for now if we can
avoid that.
This issue has been re-opened with blob instead of double. Semi-blindly
implement support for the blob type with an image data type.
Disturbingly enough when tested with non-binary data SQLite was
returning strings rather than byte vectors, tripping up the transform
function that sure expects byte vectors.
I could get down to the problem here, which is that a couple of indexes
where reported to pgloader but without any SQL definition for them, and
then pgloader would wait for non existing tasks.
It seems easier to just skip does indexes, that's what this patch does.
With this patch, the whole data massaging and final formating into the
PostgreSQL COPY TEXT format is done by the reader thread, which publishes a
batch at a time in the communication channel: a lparallel.queue object.
Before that, the raw vectors where pushed directly in the queue, offering
more flexibility to adjust to the reader and writer IO rates and
capabilities, but impeding the ability of the Garbage Collector: data still
in the queue was not collected even if not needed anymore.
The new model also uses less memory, and allows a better control over what
amount of data stays in memory. The new *concurrent-batches* parameter
should be key to being able to process huge rows.
The intent is to offering a way for the users to tune *concurrent-batches*
down to 1 for sources with massive per-row memory footprint. Even better
would be to find a way to automatically adjust the setting without spending
too much time counting the bytes we're batching.
Preliminary tests show no sensible impact on performances from this patch,
even some improvements in cases.
Revert "Shorten column names in the application to bypass a postmodern bug (or something)."
This reverts commit 240574a1a5f71edefc19a4b0f35f37862bdfeacc.