pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2025-08-08 07:16:58 +02:00

Author	SHA1	Message	Date
adrian	3da4422fb5	few typos in comments/strings	2023-05-01 10:18:30 +02:00
Dimitri Fontaine	11970bbca8	Implement tables row count ordering for MySQL. (#1120 ) This should help optimise the duration of migrating databases with very big tables and lots of smaller ones. It might be a little too naive as far as the optimisation goes, while still being an improvement on the default alphabetical one. Fixes #1099.	2020-04-04 16:40:53 +02:00
Dimitri Fontaine	b8da7dd2e9	Generic Function API for Materialized Views support. (#970 ) Implement a generic-function API to discover the source database schema and populate pgloader internal version of the catalogs. Cut down three copies of about the same code-path down to a single shared one, thanks to applying some amount of OOP to the code.	2019-05-20 19:28:38 +02:00
Dimitri Fontaine	dae5dec03c	Allow fields/columns projections when parsing header. When using a CSV header, we might find fields in a different order than the target table columns, and maybe not all of the fields are going to be read. Take account of the header we read rather than expecting the header to look like the target table definition. Fix #888.	2019-01-15 22:39:08 +01:00
Dimitri Fontaine	5c10f12a07	Fix duplicate package names. In a previous commit we re-used the package name pgloader.copy for the now separated implementation of the COPY protocol, but this package was already in use for the implementation of the COPY file format as a pgloader source. Oops. And CCL was happily doing its magic anyway, so that I've been blind to the problem. To fix, rename the new package pgloader.pgcopy, and to avoid having to deal with other problems of the same kind in the future, rename every source package pgloader.source.<format>, so that we now have pgloader.source.copy and pgloader.pgcopy, two visibily different packages to deal with. This light refactoring came with a challenge tho. The split in between the pgloader.sources API and the rest of the code involved some circular depencendies in the namespaces. CL is pretty flexible here because it can reload code definitions at runtime, but it was still a mess. To untangle it, implement a new namespace, the pgloader.load package, where we can use the pgloader.sources API and the pgloader.connection and pgloader.pgsql APIs too. A little problem gave birth to quite a massive patch. As it happens when refactoring and cleaning-up the dirt in any large enough project, right? See #748.	2018-02-24 19:24:22 +01:00
Dimitri Fontaine	8ee799070a	Simplify format-vector-row a lot. Copy some code over from cl-postgres-trivial-utf-8 and add the support for PostgreSQL COPY escaping right at the same place, allowing to allocate our formatted utf-8 buffer only once, with the escaping already installed. This patch was expected to be more about perfs, but it's actually only about code cleaning it seems, as it doesn't make a big difference in the testing I could do here. That said, getting rid of one intermediate buffer should be nice in terms of memory management.	2018-01-24 00:10:40 +01:00
Dimitri Fontaine	adf03c47ad	Clean up source code organisation. The copy format and batch facilities are no longer the meat of your PostgreSQL support in the src/pgsql directory, so have them leave in their own space.	2018-01-23 19:52:13 +01:00
Dimitri Fontaine	0549e74f6d	Implement multiple reader per table for MySQL. Experiment with the idea of splitting the read work in several concurrent threads, where each reader is reading portions of the target table, using a WHERE id <= x and id > y clause in its SELECT query. For this to kick-in a number of conditions needs to be met, as described in the documentation. The main interest might not be faster queries to overall fetch the same data set, but better concurrency with as many readers as writters and each couple its own dedicated queue.	2017-06-28 16:23:18 +02:00
Dimitri Fontaine	6d66280fa5	Review parallelism and memory behavior. The previous patch made format-vector-row allocate its memory in one go rather than byte after byte with vector-push-extend. In this patch we review our usage of batches and parallelism. Now the reader pushes each row directly to the lparallel queue and writers concurrently consume from it, cook batches in COPY format, and then send that chunk of data down to PostgreSQL. When looking at runtime profiles, the time spent writing in PostgreSQL is a fraction of the time spent reading from MySQL, so we consider that the writing thread has enough time to do the data mungling without slowing us down. The most interesting factor here is the memory behavor of pgloader, which seems more stable than before, and easier to cope with for SBCL's GC. Note that batch concurrency is no more, replaced by prefetch rows: the reader thread no longer build batches and the count of items in the reader queue is now a number a rows, not of batches of them. Anyway, with this patch in I can't reproduce the following issues: Fixes #337, Fixes #420.	2017-06-27 23:10:33 +02:00
Dimitri Fontaine	1023577f50	Review internal database migration logic. Many options are now available to pgloader users, including short cuts that where not defined clearly enough. That could result in stupid things being done at times. In particular, when picking the "data only" option then indexes are not to be dropped before loading the data, but pgloader would still try and create them again at the end of the load, because the option that controls that behavior default to true and is not impacted by the "data only" choice. In this patch we review the logic and ensure it's applied in the same fashion in the different phases of the database migration: preparation, copying, rebuilding of indexes and completion of the database model. See also `96b2af6b2a` where we began fixing oddities but didn't go far enough.	2017-02-26 14:48:36 +01:00
Dimitri Fontaine	70572a2ea7	Implement support for existing target databases. Also known as the ORM case, it happens that other tools are used to create the target schema. In that case pgloader job is to fill in the exiting target tables with the data from the source tables. We still focus on load speed and pgloader will now DROP the constraints (Primary Key, Unique, Foreign Keys) and indexes before running the COPY statements, and re-install the schema it found in the target database once the data load is done. This behavior is activated when using the “create no tables” option as in the following test-case setup: with create no tables, include drop, truncate Fixes #400, for which I got a test-case to play with!	2016-08-06 20:19:15 +02:00
Dimitri Fontaine	42e9e521e0	Add option "max parallel create index". By default, pgloader will start as many parallel CREATE INDEX commands as the maximum number of indexes you have on any single table that takes part in the load. As this number might be so great as to exhaust the target PostgreSQL server (e.g. maintenance_work_mem), we add an option to limit that to something reasonnable when the source schema isn't. Fix #386 in which 150 indexes are found on a single source table.	2016-04-11 17:40:52 +02:00
Dimitri Fontaine	7dd69a11e1	Implement concurrency and workers for files sources. More than the syntax and API tweaks, this patch also make it so that a multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be loaded with several files in the group in parallel. To that effect, tweak again the md-connection and md-copy implementations.	2016-01-16 22:53:55 +01:00
Dimitri Fontaine	dcc8eb6d61	Review api around worker-count. It was worker-count and it's now exposed as the worker in the WITH clause, but we can actually keep it as worker-count in the internal API, and it feels better that way.	2016-01-16 19:49:52 +01:00
Dimitri Fontaine	eb45bf0338	Expose concurrency settings to the end users. Add the workers and concurrency settings to the LOAD commands for database sources so that users can tweak them now, and add mentions of them in the documentation too. From the documentation string of the copy-from method as found in src/sources/common/methods.lisp: We allow WORKER-COUNT simultaneous workers to be active at the same time in the context of this COPY object. A single unit of work consist of several kinds of workers: - a reader getting raw data from the COPY source with `map-rows', - N transformers preparing raw data for PostgreSQL COPY protocol, - N writers sending the data down to PostgreSQL. The N here is setup to the CONCURRENCY parameter: with a CONCURRENCY of 2, we start (+ 1 2 2) = 5 concurrent tasks, with a CONCURRENCY of 4 we start (+ 1 4 4) = 9 concurrent tasks, of which only WORKER-COUNT may be active simultaneously. Those options should find their way in the remaining sources, that's for a follow-up patch tho.	2016-01-15 23:22:32 +01:00
Dimitri Fontaine	f256e12a4f	Review load parallelism settings. pgloader parallel workload is still hardcoded, but at least the code now uses clear parameters as input so that it will be possible in a later patch to expose them to the end-user. The notions of workers and concurrency are now handled as follows: - concurrency is how many tasks are allowed to happen at once, by default we have a reader thread, a transformer thread and a COPY thread all actives for each table being loaded, - worker-count is how many parallel threads are allowed to run simultaneously and default to 8 currently, which means that in a typical migration from a database source and given default concurrency or 1 (3 threads), we might be loaded up to 3 different tables at any time. The idea is to expose those settings to the user in the load file and as command line options (such as --jobs) and see what it gives us. It might help e.g. use more cores in loading a single CSV file. As of this patch, there still can only be only one reader thread and the number of transformer threads must be the same as the number of COPY threads. Finally, the CSV-like files user-defined projections are now handled in the tranformation threads rather than in the reader thread...	2016-01-11 01:43:38 +01:00
Dimitri Fontaine	9e4938cea4	Implement PostgreSQL catalogs data structure. In order to share more code in between the different source types, finally have a go at the quite horrible mess of anonymous data structures floating around. Having a catalog and schema instances not only allows for code cleanup, but will also allow to implement some bug fixes and wishlist items such as mapping tables from a schema to another one. Also, supporting database sources having a notion of "schema" (in between "catalog" and "table") should get easier, including getting on-par with MySQL in the MS SQL support (materialized views has been asked for already). See #320, #316, #224 for references and a notion of progress being made. In passing, also clean up the copy-databases methods for database source types, so that they all use a fetch-metadata generic function and a prepare-pgsql-database and a complete-pgsql-database generic function. Actually, a single method does the job here. The responsibility of introspecting the source to populate the internal catalog/schema representation is now held by the fetch-metadata generic function, which in turn will call the specialized versions of list-all-columns and friends implementations. Once the catalog has been fetched, an explicit CAST call is then needed before we can continue. Finally, the fields/columns/transforms slots in the copy objects are still being used by the operative code, so the internal catalog representation is only used up to starting the data copy step, where the copy class instances are then all that's used. This might be refactored again in a follow-up patch.	2015-12-30 21:53:01 +01:00
Dimitri Fontaine	4df3167da1	Introduce another worker thread: transformers. We used to have a reader and a writer cooperating concurrently into loading the data from the source to PostgreSQL. The tranformation of the data was then the responsibility of the reader thread. Measurements showed that the PostgreSQL processes were mostly idle, waiting for the reader to produce data fast enough. In this patch we introduce a third worker thread that is responsible for processing the raw data into pre-formatted batches, allowing the reader to focus on extracting the data only. We now have two lparallel queues involved in the processing, the raw queue contains the vectors of raw data directly, and the processed-queue contains batches of properly encoded strings for the COPY text protocol. On the test laptop the performance gain isn't noticeable yet, it might be that we need much larger data sets to see a gain here. At least the setup isn't detrimental to performances on smaller data sets. Next improvements are going to allow more features: specialized batch retry thread and parallel table copy scheduling for database sources. Let's also continue caring about performances and play with having several worker and writer threads for each reader. In later patches. And some day, too, we will need to make the number of workers a user defined variable rather than something hard coded as today. It's on the todo list, meanwhile, dear user, consider changing the (make-kernel 6) into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp, and consider enlighting me with whatever it is you find by doing so!	2015-10-23 00:17:58 +02:00
Dimitri Fontaine	41e9eebd54	Rationalize common generic API implementation. When devising the common API, the first step has been to implement specific methods for each generic function of the protocol. It now appears that in some cases we don't need the extra level of flexibility: each change of the API has been systematically reported to all the specific methods, so just use a single generic definition where possible. In particular, introduce new intermediate class for COPY subclasses allowing to share more common code in the methods implementation, rather than having to copy/paste and maintain several versions of the same code. It would be good to be able to centralize more code for the database sources and how they are organized around metadata/import-data/complete schema, but it doesn't look obvious how to do it just now.	2015-10-05 21:25:21 +02:00
Dimitri Fontaine	cd46b6cbed	Clean up common code for sources. Only move code around, creating a src/sources/common directory with several files in there so as to split the too big src/sources.lisp.	2015-01-08 23:17:40 +01:00

20 Commits