In passing, refactor the *pgconn- dynamic bindings in favor of directly
using the connection property list straight from the connection string
parser, processing it when necessary. That allows to make it simple to
add an internal :use-ssl property.
First, the index names in MS SQL, as in MySQL, are only unique per
table, whereas they need to be globally unique (per database) in
PostgreSQL. So reuse the infrastructure we had for MySQL here.
Second, the way we trick table names in index and fkey structures means
that we already did quote the names and we don't want to quote them
again, so add a new possible *identifier-case* value to handle the case
where nothing is to be done, pretty please.
Rather than doing ALTER TABLE directly, use CREATE UNIQUE INDEX in the
all in parallel concurrent index build per table, and only in the end
game "upgrade" that unique index into a PRIMARY KEY using ALTER TABLE.
The reason why it's a good idea to do that is to avoid an ACCESS
EXCLUSIVE LOCK at ALTER TABLE time, which is killing our index build
concurrency.
That patch is not a principaled approach at fixing the problem but
should allow for not messing up with fully qualified table names.
A proper way to do it would be to have a pgsql object name structure
composed of the catalog, the schema and the name as separate entries,
with assorted API to print that object properly. That's for another day
though.
When adding the CONTEXT message parsing I totally forgot that PostgreSQL
provides a nice error message translation capability. The code now copes
better with the situation, using a more advanced regular expression.
We could inline the known translations in the matching, but that would
be tedious to maintain, so we just use loose matching rules here.
When in :data logging mode we log the whole data set as we read then
write it, which is quite a lot of data. Our current logging system works
by filling up a queue that the cl-log lib is then fed from, and sending
lots of data in that queue is way expensive, stop doing that.
Hopefully we don't need to revisit the logs more than that, the other
messages should be few enough not to count much when doing a full load.
The code forgot completely that MySQL column name references in foreign
key definitions have to follow the identifier case rules, this patch fix
that.
To be able to do that, we need to parse the GROUP_CONCAT() result that
lists the FK columns, as there's apparently no arrays in MySQL. The
problem here is that about any character is allowed in column names when
`quoted`, so using a comma here might reveal to be fragile later.
The truncate command is only sent to PostgreSQL when we didn't just
CREATE TABLE before. Some refactoring would be necessary to fit the
TRUNCATE command within the same transaction as the CREATE TABLE
command, for PostgreSQL performances.
This patch has been testing with MySQL and SQLite sources, the trick is
that to be able to test it, it's needed to first make a full
import (creating the target tables), so the test are not modified yet.
The patch from pull request #30 was hard-coding the PostgreSQL side quoting,
we are using the quote_ident() function instead, as it's now available in
every PostgreSQL production release (8.4 included).
With the new internal setting *copy-batch-size* it's now possible to
instruct pgloader to close batches early (before *copy-batch-rows* limit)
when crossing the byte count threshold.
When set to 20 MB it allows the new test case (exhausted) to pass under SBCL
and CCL, and there's no measurable cost when *copy-batch-size* is set to
nil (its default value) in the testing done.
This patch is published without any way to tune the values from the command
language yet, that's the next step once its been proven effective.
With this patch, the whole data massaging and final formating into the
PostgreSQL COPY TEXT format is done by the reader thread, which publishes a
batch at a time in the communication channel: a lparallel.queue object.
Before that, the raw vectors where pushed directly in the queue, offering
more flexibility to adjust to the reader and writer IO rates and
capabilities, but impeding the ability of the Garbage Collector: data still
in the queue was not collected even if not needed anymore.
The new model also uses less memory, and allows a better control over what
amount of data stays in memory. The new *concurrent-batches* parameter
should be key to being able to process huge rows.
The intent is to offering a way for the users to tune *concurrent-batches*
down to 1 for sources with massive per-row memory footprint. Even better
would be to find a way to automatically adjust the setting without spending
too much time counting the bytes we're batching.
Preliminary tests show no sensible impact on performances from this patch,
even some improvements in cases.
This message has the line number where the erroneous data was found on the
server, and given the pre-processing we already done at that point, it's
easy to convert that number into an index into the current batch, an array.
To do do, we need Postmodern to expose the CONTEXT error message and we need
to parse it. The following pull request cares about the Postmodern side of
things:
https://github.com/marijnh/Postmodern/pull/46
The parsing is done as simply as possible, only assuming that the error
message is using comma separators and having the line number in second
position. The parsing as done here should still work with localized message
strings.
CONTEXT: COPY errors, line 3, column b: "2006-13-11"
This change should significantly reduce the cost of error processing.
Revert "Shorten column names in the application to bypass a postmodern bug (or something)."
This reverts commit 240574a1a5f71edefc19a4b0f35f37862bdfeacc.