pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2025-08-12 17:26:58 +02:00

Author	SHA1	Message	Date
adrian	3da4422fb5	few typos in comments/strings	2023-05-01 10:18:30 +02:00
lukesilvia	48d8ed0613	fix: ranged load does not load last record. (#1203 )	2020-08-31 20:12:32 +02:00
Dimitri Fontaine	11970bbca8	Implement tables row count ordering for MySQL. (#1120 ) This should help optimise the duration of migrating databases with very big tables and lots of smaller ones. It might be a little too naive as far as the optimisation goes, while still being an improvement on the default alphabetical one. Fixes #1099.	2020-04-04 16:40:53 +02:00
Dimitri Fontaine	b8da7dd2e9	Generic Function API for Materialized Views support. (#970 ) Implement a generic-function API to discover the source database schema and populate pgloader internal version of the catalogs. Cut down three copies of about the same code-path down to a single shared one, thanks to applying some amount of OOP to the code.	2019-05-20 19:28:38 +02:00
Dimitri Fontaine	4b9cbcbce3	Fix MySQL processing of Mat Views, again. The previous fix left to be desired in that it didn't update the code expectations about what a view-name looks like in fetch-metadata. In particular we would use (cons NIL "table-name") in the only-table facility, which expects non qualified names as strings. Switch to using the :including filters facility instead, as we do in MS SQL. Later, we might want to deprecate our internal :only-tables facility. Fix #932. Again.	2019-04-15 23:06:57 +02:00
Dimitri Fontaine	1a4ce4fb46	Re-indent block before editing. It turns out that current Emacs settings aren't agreeing with the file indenting, let's clean that up before modifying the file, so that later reviews are easier.	2019-04-15 23:05:28 +02:00
Dimitri Fontaine	0f58a3c84d	Assorted fixes: catalogs SQLtypes and MySQL decoding as. It turns out that when trying to debug "decoding as" the SQLtype listing support in sqltype-list was found broken, so this patch fixes it. Then goes on to fix the DECODING AS filters support, which we have switched to using the better regexp-or-string filter struct but forgot to update the matching code accordingly. Fixes #665.	2018-08-31 22:51:41 -07:00
Dimitri Fontaine	5c10f12a07	Fix duplicate package names. In a previous commit we re-used the package name pgloader.copy for the now separated implementation of the COPY protocol, but this package was already in use for the implementation of the COPY file format as a pgloader source. Oops. And CCL was happily doing its magic anyway, so that I've been blind to the problem. To fix, rename the new package pgloader.pgcopy, and to avoid having to deal with other problems of the same kind in the future, rename every source package pgloader.source.<format>, so that we now have pgloader.source.copy and pgloader.pgcopy, two visibily different packages to deal with. This light refactoring came with a challenge tho. The split in between the pgloader.sources API and the rest of the code involved some circular depencendies in the namespaces. CL is pretty flexible here because it can reload code definitions at runtime, but it was still a mess. To untangle it, implement a new namespace, the pgloader.load package, where we can use the pgloader.sources API and the pgloader.connection and pgloader.pgsql APIs too. A little problem gave birth to quite a massive patch. As it happens when refactoring and cleaning-up the dirt in any large enough project, right? See #748.	2018-02-24 19:24:22 +01:00
Dimitri Fontaine	ba2d8669c3	Add support for the newer Qmynd error handling. We now have a qmynd-impl::decoding-error condition to deal with, which as a very good error reporting, so that we don't need to poke into babel details anymore. The error message adds the column name, type and collation to the output, too. We keep the babel handlers for a while until people have all migrated to using the patch in qmynd. With the fix to Qmynd, Fix #716.	2018-01-22 16:14:05 +01:00
Dimitri Fontaine	07cdf3e7e5	Use MySQL column names in MySQL queries. The query for concurrency-support didn't get the memo that we should ignore PostgreSQL identifier-case when querying the source MySQL database. Fix the query string to include column names as given by the MySQL catalogs. In bug report #703, the problem is found in PostgreSQL queries. This has been fixed before already. Trying to reproduce the bug produced an error in the concurrency-support query instead, so let's fix this one. Fix #703.	2017-12-22 14:15:46 +01:00
Dimitri Fontaine	1d7706c045	Fix the MySQL encoding error handling. The error handling would try and read past the error buffer in some cases, when the BABEL lib would give a position that's after the buffer read. Fix #661.	2017-11-13 11:27:47 +01:00
Dimitri Fontaine	db7a91d6c4	Add the MySQL target schema to the search_path. In the next release, pgloader defaults to targetting a new schema named the same as the MySQL database, because that's what makes more sense. But people are used to having 'public' in the search_path and everything in there. So when creating our target schema, when migrating from MySQL, arrange it so that the new schema is in the search_path by issuing a command like: ALTER DATABASE plop SET search_path TO public, f1db; And make this command visible in verbose (NOTICE) mode too, so that user can see what happens. Fix #654. I think.	2017-11-02 12:40:21 +01:00
Dimitri Fontaine	dfac729daa	Refrain from querying the catalogs again. When we already have the information in the pgloader internal catalogs, don't issue another MySQL query. In this case, it's been used to fetch the list of columns and their data types so that we can choose to send either `colname` or maybe astext(`colname`) as `colname` for some geographic types. That's one less MySQL query per table.	2017-09-14 15:35:45 +02:00
Dimitri Fontaine	d37ad27754	Handle empty tables in concurrency support for MySQL. When the table is empty we get nil for min and max values of the id column. In that case we don't compute a set of ranges and “cancel” concurrency support for the empty table. Fixes #596.	2017-07-18 13:35:01 +02:00
Dimitri Fontaine	0549e74f6d	Implement multiple reader per table for MySQL. Experiment with the idea of splitting the read work in several concurrent threads, where each reader is reading portions of the target table, using a WHERE id <= x and id > y clause in its SELECT query. For this to kick-in a number of conditions needs to be met, as described in the documentation. The main interest might not be faster queries to overall fetch the same data set, but better concurrency with as many readers as writters and each couple its own dedicated queue.	2017-06-28 16:23:18 +02:00
Dimitri Fontaine	4931604361	Allow ALTER SCHEMA command for MySQL. This pgloader command allows to migrate tables while changing the schema they are found into in between their MySQL source database and their PostgreSQL target database. This changes the default behavior of pgloader with MySQL from always targetting the 'public' schema to targetting by default a schema named the same as the MySQL database. You can revert to the old behavior by adding a rule: ALTER SCHEMA 'dbname' RENAME TO 'public We might want to add a patch to re-install the default behavior later. Also see #489 where it used not to be possible to rename the schema at migration time, causing strange errors (you need to spot NIL as the schema name in the "failed to find target table" messages.	2016-12-18 19:31:21 +01:00
Dimitri Fontaine	a7291e9b4b	Simplify copy-database implementation further. Following-up to the recent refactoring effort, the IXF and DB3 source classes didn't get the memo that they could piggyback on the generic copy-database implementation. This patch implements that. In passing, also simplify the instanciate-table-copy-object method for copy subclasses that need specialization here, by using change-class and call-next-method so as to reuse the generic code as much as possible.	2016-01-01 14:28:09 +01:00
Dimitri Fontaine	9e4938cea4	Implement PostgreSQL catalogs data structure. In order to share more code in between the different source types, finally have a go at the quite horrible mess of anonymous data structures floating around. Having a catalog and schema instances not only allows for code cleanup, but will also allow to implement some bug fixes and wishlist items such as mapping tables from a schema to another one. Also, supporting database sources having a notion of "schema" (in between "catalog" and "table") should get easier, including getting on-par with MySQL in the MS SQL support (materialized views has been asked for already). See #320, #316, #224 for references and a notion of progress being made. In passing, also clean up the copy-databases methods for database source types, so that they all use a fetch-metadata generic function and a prepare-pgsql-database and a complete-pgsql-database generic function. Actually, a single method does the job here. The responsibility of introspecting the source to populate the internal catalog/schema representation is now held by the fetch-metadata generic function, which in turn will call the specialized versions of list-all-columns and friends implementations. Once the catalog has been fetched, an explicit CAST call is then needed before we can continue. Finally, the fields/columns/transforms slots in the copy objects are still being used by the operative code, so the internal catalog representation is only used up to starting the data copy step, where the copy class instances are then all that's used. This might be refactored again in a follow-up patch.	2015-12-30 21:53:01 +01:00
Dimitri Fontaine	dca3dacf4b	Don't issue useless MySQL catalog queries... When the option "WITH no foreign keys" is in use, it's not necessary to go read the foreign key information_schema bits at all, so just don't issue the query, and same thing with the "create no indexes" option. In no that old versions of MySQL, the referential_constraints table of information_schema doesn't exist, so this should make pgloader compatible with MySQL 5.0 something and earlier.	2015-12-03 19:24:00 +01:00
Dimitri Fontaine	2dd7f68a30	Fix index completion management in MySQL and SQLite. We used to wait for the wrong number of workers, meaning the rest of the code began running before the indexes where all available. A user report where one of the indexes takes a very long time to compute made it obvious. In passing, also improve reporting of those rendez-vous sections.	2015-11-29 17:29:57 +01:00
Dimitri Fontaine	bc870ac96c	Use 2 copy threads per target table. It's been proven by Andres Freund benchmarks that the best number of parallel COPY threads concurrently active against a single table is 2 as of PostgreSQL current development version (up to 9.5 stable, might still apply to 9.6 depending on how we might solve the problem). Henceforth hardcode 2 COPY threads in pgloader. This also has the advantage that in the presence of lots of bad rows, we should sustain a better throughtput and not stall completely. Finally, also improve the MySQL setup to use 8 threads by default, that is be able to load two tables concurrently, each with 2 COPY workers, a reader and a transformer thread. It's all still experimental as far as performances go, next patches should bring the capability to configure the parallelism wanted from the command line and load command tho. Also, other source types will want to benefit from the same capabilities, it just happens that it's easier to play with MySQL first for some reasons here.	2015-11-17 17:06:36 +01:00
Dimitri Fontaine	4df3167da1	Introduce another worker thread: transformers. We used to have a reader and a writer cooperating concurrently into loading the data from the source to PostgreSQL. The tranformation of the data was then the responsibility of the reader thread. Measurements showed that the PostgreSQL processes were mostly idle, waiting for the reader to produce data fast enough. In this patch we introduce a third worker thread that is responsible for processing the raw data into pre-formatted batches, allowing the reader to focus on extracting the data only. We now have two lparallel queues involved in the processing, the raw queue contains the vectors of raw data directly, and the processed-queue contains batches of properly encoded strings for the COPY text protocol. On the test laptop the performance gain isn't noticeable yet, it might be that we need much larger data sets to see a gain here. At least the setup isn't detrimental to performances on smaller data sets. Next improvements are going to allow more features: specialized batch retry thread and parallel table copy scheduling for database sources. Let's also continue caring about performances and play with having several worker and writer threads for each reader. In later patches. And some day, too, we will need to make the number of workers a user defined variable rather than something hard coded as today. It's on the todo list, meanwhile, dear user, consider changing the (make-kernel 6) into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp, and consider enlighting me with whatever it is you find by doing so!	2015-10-23 00:17:58 +02:00
Dimitri Fontaine	4f3b3472a2	Cleanup and timing display improvements. Have each thread publish its own start-time so that the main thread may compute time spent in source and target processing, in order to fix the crude hack of taking (max read-time write-time) in the total time column of the summary. We still have some strange artefacts here: we consider that the full processing time is bound to the writer thread (:target), because it needs to have the reader done already to be able to COPY the last batch... but in testing I've seen some :source timings higher than the :target ones... Let's solve problems one at a time tho, I guess multi-threading and accurate wall clock times aren't to be expected to mix and match that easily anyway (multi cores, single RTC and all that).	2015-10-22 22:36:40 +02:00
Dimitri Fontaine	633067a0fd	Allow more parallelism in database migrations. The newly added statistics are showing that read+write times are not enough to explain how long we wait for the data copying, so it must be the workers setup rather than the workers themselves. From there, let lparallel work its magic in scheduling the work we do in parallel in pgloader: rather than doing blocking receive-result calls for each table, only receive-result at the end of the whole copy-database processing. On test data here on the laptop we go from 6s to 3s to migrate the sakila database from MySQL to PostgreSQL: that's because we have lots of very small tables, so the cost of waiting after each COPY added up quite quickly. In passing, stop sharing the same connection object in between parallel workers that used to be controlled active in-sequence, see the new API clone-connection (which takes over new-pgsql-connection).	2015-10-20 22:15:55 +02:00
Dimitri Fontaine	41e9eebd54	Rationalize common generic API implementation. When devising the common API, the first step has been to implement specific methods for each generic function of the protocol. It now appears that in some cases we don't need the extra level of flexibility: each change of the API has been systematically reported to all the specific methods, so just use a single generic definition where possible. In particular, introduce new intermediate class for COPY subclasses allowing to share more common code in the methods implementation, rather than having to copy/paste and maintain several versions of the same code. It would be good to be able to centralize more code for the database sources and how they are organized around metadata/import-data/complete schema, but it doesn't look obvious how to do it just now.	2015-10-05 21:25:21 +02:00
Dimitri Fontaine	0d9c2119b1	Send one update-stats message per batch. Update the stats used to be a quite simple incf and doing it once per read row was good enough, but now that it involves sending a message to the monitor thread let's only send a message per batch, reducing the communication load here.	2015-10-05 18:04:08 +02:00
Dimitri Fontaine	96a33de084	Review the stats and reporting code organisation. In order to later be able to have more worker threads sharing the load (multiple readers and/or writers, maybe more specialized threads too), have all the stats be managed centrally by a single thread. We already have a "monitor" thread that get passed log messages so that the output buffer is not subject to race conditions, extend its use to also deal with statistics messages. In the current code, we send a message each time we read a row. In some future commits we should probably reduce the messaging here to something like one message per batch in the common case. Also, as a nice side effect of the code simplification and refactoring this fixes #283 wherein the before/after sections of individual CSV files within an ARCHIVE command where not counted in the reporting.	2015-10-05 01:46:29 +02:00
Dimitri Fontaine	ba44e9786d	More MySQL debugging information.	2015-08-20 12:34:12 +02:00
Dimitri Fontaine	3af99051d2	Fix the preserve index names option. MySQL names its primary keys "PRIMARY" and we need to always uniquify this name even when the used asked pgloader to preserve index names. Also, the create-indexes-again function now needs to ask for index names to be preserved specifically.	2015-07-18 23:39:32 +02:00
Alex Baretta	817fc9a258	Fix string delimiter syntax in COMMENT statements	2015-05-26 16:10:39 -07:00
Dimitri Fontaine	7d2d09ce68	Add the option to preserve MySQL index names, fix #187 . See test/parse/hans.goeuro.load for an example usage of the new option. In passing, any error when creating indexes is now properly reported and logged, which was missing previously. Oops.	2015-03-07 20:19:47 +01:00
Dimitri Fontaine	48f451bdbc	Implement the option to disable triggers when loading data. This option is dangerous and allows to skip ALL triggers when loading data against PostgreSQL. This includes foreign key constraints definitions and will allow loading data out of order. When using both the options "create no table" and "disable triggers" it will be possible to load data into a schema prepared by your favorite external tool, at the cost of not validating FK constraints. Use with care. Fix #167.	2015-02-19 15:05:10 +01:00
Dimitri Fontaine	b403c40d30	Implement support for MySQL comments, fix #126 . Only table (BASE TABLE) and columns comments are supported now. I didn't even try to see if more are possible and interesting to support.	2015-01-09 01:49:56 +01:00
Dimitri Fontaine	302a7d402b	Refactor connection handling, and clean-up many things. That's the big refactoring patch I've been sitting on for too long. First, refactor connection handling to use a uniformed "connection" concept (class and generic functions API) everywhere, so that the COPY derived objects just use that in their :source-db and :target-db slots. Given that, we don't need no messing around with pgconn and myconn- and other special variables at all anywhere in the tree. Second, clean up some oddities accumulated over time, where some parts of the code didn't get the memo when new API got into place. Third, fix any other oddity or missing part found while doing those first two activities, it was long overdue anyway...	2014-12-26 21:50:29 +01:00

34 Commits