It's easy to avoid having the warning about unused lexical variable with the
proper declaration, that I failed to install before because of a syntax
error when I tried. Let's fix it now that I realise what was wrong.
Understanding the timings requires not only the number of rows copied into
each table but also how many bytes that represent. We add that information
now in tht output.
The number of bytes presented is computed from the unicode representation we
prepare in pgloader for each row before sending it down to PostgreSQL.
Use a generic function protocol in order to implement the human readable,
verbose, csv, copy and json reporting output formats. This is much cleaner
and extensible than the previous way.
Use that new power to implement a real JSON output from the internal state
object.
The pgstate infrastructure already had lots of details about what's going
on, add to it the information about how many bytes are sent in every batch,
and use this information in the monitor when something long is happening to
display how many rows we sent from the beginning for this (supposedly) huge
table, along with bytes and speed (bytes per seconds).
The concurrency nature of pgloader made it non obvious where to implement
the timers properly, and as a result the tracking of how long it took to
actually transfer the data was... just wrong.
Rather than trying to measure the time spent in any particular piece of the
code, we now emit "start" and "stop" stats messages to the monitor thread at
the right places (which are way easier to find, in the worker threads) and
have the monitor figure out how long it took really.
Fix#506.
Now that we have fixed the output of the per-table total timing, we can only
show that timing by default. With more verbosity pgloader will add the extra
columns, and in computer oriented formats (json, csv, copy) all the details
are always provided of course.
See #506.
In order to share more code in between the different source types,
finally have a go at the quite horrible mess of anonymous data
structures floating around.
Having a catalog and schema instances not only allows for code cleanup,
but will also allow to implement some bug fixes and wishlist items such
as mapping tables from a schema to another one.
Also, supporting database sources having a notion of "schema" (in
between "catalog" and "table") should get easier, including getting
on-par with MySQL in the MS SQL support (materialized views has been
asked for already).
See #320, #316, #224 for references and a notion of progress being made.
In passing, also clean up the copy-databases methods for database source
types, so that they all use a fetch-metadata generic function and a
prepare-pgsql-database and a complete-pgsql-database generic function.
Actually, a single method does the job here.
The responsibility of introspecting the source to populate the internal
catalog/schema representation is now held by the fetch-metadata generic
function, which in turn will call the specialized versions of
list-all-columns and friends implementations. Once the catalog has been
fetched, an explicit CAST call is then needed before we can continue.
Finally, the fields/columns/transforms slots in the copy objects are
still being used by the operative code, so the internal catalog
representation is only used up to starting the data copy step, where the
copy class instances are then all that's used.
This might be refactored again in a follow-up patch.
Add metrics to devise where the time is spent in current pgloader code
so that it's possible to then optimize away the batch processing as we
do it today.
Given the following extract of the measures, it seems that doing the
data transformations in the reader thread isn't so bright an idea. More
to come.
table name total time read write
----------------- -------------- --------- ---------
extract 2.014s
before load 0.050s
fetch 0.000s
----------------- -------------- --------- ---------
geolite.location 16.090s 15.933s 5.732s
geolite.blocks 28.896s 28.795s 5.312s
----------------- -------------- --------- ---------
after load 37.772s
----------------- -------------- --------- ---------
Total import time 1m25.082s 44.728s 11.044s
In order to later be able to have more worker threads sharing the
load (multiple readers and/or writers, maybe more specialized threads
too), have all the stats be managed centrally by a single thread. We
already have a "monitor" thread that get passed log messages so that the
output buffer is not subject to race conditions, extend its use to also
deal with statistics messages.
In the current code, we send a message each time we read a row. In some
future commits we should probably reduce the messaging here to something
like one message per batch in the common case.
Also, as a nice side effect of the code simplification and refactoring
this fixes#283 wherein the before/after sections of individual CSV
files within an ARCHIVE command where not counted in the reporting.
We used to parse qualified table names as a simple string, which then
breaks attempts to be smart about how to quote idenfifiers. Some sources
are known to accept dots in quoted table names and we need to be able to
process that properly without tripping on qualified table names too
late.
Current code might not be the best approach as it's just using either a
cons or a string for table names internally, rather than defining a
proper data structure with a schema and a name slot.
Well, that's for a later cleanup patch, I happen to be lazy tonight.