pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2025-08-08 07:16:58 +02:00

Author	SHA1	Message	Date
Maxim Filippov	6d02591e9c	Fix typo in documentation	2016-01-20 12:41:50 +03:00
Dimitri Fontaine	327745110a	MySQL bytea default value can be "". Fix 291. Thanks to a reproducable test case we can see that MySQL default for a varbinary column is an empty string, so tweak the transform function byte-vector-to-bytea in order to cope with that.	2016-01-18 21:55:01 +01:00
Dimitri Fontaine	d9d9e06c0f	Another attempt at fixing #323 . Rather than trying hard to have PostgreSQL fully qualify the index name with tricks around search_path setting at the time ::regclass is executed, simply join on pg_namespace to retrieve that schema in a new slot in our pgsql-index structure so that we can then reuse it when needed. Also add a test case for the scenario, including both a UNIQUE constraint and a classic index, because the DROP and CREATE/ALTER instructions differ.	2016-01-17 01:54:36 +01:00
Dimitri Fontaine	7dd69a11e1	Implement concurrency and workers for files sources. More than the syntax and API tweaks, this patch also make it so that a multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be loaded with several files in the group in parallel. To that effect, tweak again the md-connection and md-copy implementations.	2016-01-16 22:53:55 +01:00
Dimitri Fontaine	aa8b756315	Fix when to create indexes. In the recent refactoring and improvements of parallelism the indexes creation would kick in before we know that the data is done being copied over to the target table. Fix that by maintaining a writers-count hashtable and only starting to create indexes when that count reaches zero, meaning all the concurrent tasks started to handle the COPY of the data are now done.	2016-01-16 19:50:21 +01:00
Dimitri Fontaine	dcc8eb6d61	Review api around worker-count. It was worker-count and it's now exposed as the worker in the WITH clause, but we can actually keep it as worker-count in the internal API, and it feels better that way.	2016-01-16 19:49:52 +01:00
Dimitri Fontaine	eb45bf0338	Expose concurrency settings to the end users. Add the workers and concurrency settings to the LOAD commands for database sources so that users can tweak them now, and add mentions of them in the documentation too. From the documentation string of the copy-from method as found in src/sources/common/methods.lisp: We allow WORKER-COUNT simultaneous workers to be active at the same time in the context of this COPY object. A single unit of work consist of several kinds of workers: - a reader getting raw data from the COPY source with `map-rows', - N transformers preparing raw data for PostgreSQL COPY protocol, - N writers sending the data down to PostgreSQL. The N here is setup to the CONCURRENCY parameter: with a CONCURRENCY of 2, we start (+ 1 2 2) = 5 concurrent tasks, with a CONCURRENCY of 4 we start (+ 1 4 4) = 9 concurrent tasks, of which only WORKER-COUNT may be active simultaneously. Those options should find their way in the remaining sources, that's for a follow-up patch tho.	2016-01-15 23:22:32 +01:00
Dimitri Fontaine	fb40a472ab	Simplify database WITH option handling. Share more code by having a common flattening function as a semantic predicate in the grammar.	2016-01-15 22:34:27 +01:00
Dimitri Fontaine	bfdbb2145b	Fix with drop index option, fix #323 . Have PostgreSQL always fully qualify the index related objects and SQL definition statements when fetching the list of indexes of a table, by playing with an empty search_path. Also improve the whole index creation by passing the table object as the context where to derive the table-name from, so that schema qualified tables are taken into account properly.	2016-01-15 15:04:07 +01:00
Dimitri Fontaine	1ff204c172	Typo fix.	2016-01-15 14:45:19 +01:00
Dimitri Fontaine	44a2bd14d4	Fix custom CAST rules with expressions, fix #322 . In a previous commit the typemod matching code had been broken, and we failed to notice that until now. Thanks to bug report #322 we just got the memo... Add a test case in the local-only MySQL database. The regression testing facilities should be improved to be able to test a full database, and then to dynamically create said database from code or something to ease test coverage of those cases.	2016-01-12 14:55:17 +01:00
Dimitri Fontaine	2c200f5747	Improve error handling for pkeys creation. When creating the primary keys on top of the unique indexes, we might still have errors (e.g. with NULL values). Make it so that a failure in one pkey doesn't fail every other one, by having them all run within a single connection rather than a single transaction.	2016-01-12 14:53:42 +01:00
Dimitri Fontaine	133028f58d	Desultory review code indentation.	2016-01-12 14:52:44 +01:00
Dimitri Fontaine	ee69b8d4ce	Randomly tweak batch sizes. In order to avoid all concurrently prepared batches of rows to get sent to PostgreSQL COPY command at the same time exactly, randomly vary the size of each batch between -30% and +30% of the batch rows parameter.	2016-01-11 21:29:29 +01:00
Dimitri Fontaine	f256e12a4f	Review load parallelism settings. pgloader parallel workload is still hardcoded, but at least the code now uses clear parameters as input so that it will be possible in a later patch to expose them to the end-user. The notions of workers and concurrency are now handled as follows: - concurrency is how many tasks are allowed to happen at once, by default we have a reader thread, a transformer thread and a COPY thread all actives for each table being loaded, - worker-count is how many parallel threads are allowed to run simultaneously and default to 8 currently, which means that in a typical migration from a database source and given default concurrency or 1 (3 threads), we might be loaded up to 3 different tables at any time. The idea is to expose those settings to the user in the load file and as command line options (such as --jobs) and see what it gives us. It might help e.g. use more cores in loading a single CSV file. As of this patch, there still can only be only one reader thread and the number of transformer threads must be the same as the number of COPY threads. Finally, the CSV-like files user-defined projections are now handled in the tranformation threads rather than in the reader thread...	2016-01-11 01:43:38 +01:00
Dimitri Fontaine	94ef8674ec	Typo fix (of sorts) Some API didn't get the table-name to table memo...	2016-01-11 01:42:18 +01:00
Dimitri Fontaine	a3fd22acd3	Review pgloader encoding story. Thanks to Common Lisp character data type, it's easy for pgloader to enforce always speaking to PostgreSQL in utf-8, and that's what has been done from the beginning actually. Now, without good reason for that, the first example of a SET clause that has been added to the docs where about how to set client_encoding, which should NOT be done. Fix that at the use level by removing the bad example from the docs and adding a WARNING whenever the client_encoding is set to a known bad value. It's a WARNING because we then simply force 'utf-8' anyway. Also, review completely the format-vector-row function to avoid doing double work with the Postmodern facilities we piggyback on. This was done halfway through and the utf-8 conversion was actually done twice.	2016-01-11 01:27:36 +01:00
Dimitri Fontaine	cf73a0e6c0	Merge pull request #318 from richardkmichael/detect-sbcl-core-compression Detect sbcl core compression and Makefile gardening.	2016-01-10 17:53:48 +01:00
Richard Michael	6dcdf4711b	Easier install by detecting SBCL core-compression. Various Linux distributions provide SBCL without core-compression enabled. On the other hand, Mac OSX (at least via `homebrew`) SBCL with core-compression enabled. To make installation easier, teach the make process to detect core-compression, and use it if possible.	2016-01-09 22:17:02 -05:00
Dimitri Fontaine	d60b64c03b	Implement MS SQL newsequentialid() default value. We convert the default value call to newsequentialid() into a call to the PostgreSQL uuid-ossp uuid_generate_v1() which seems like the equivalent function. The extension "uuid-ossp" needs to be installed in the target database. (Blind) Fix #246.	2016-01-08 22:43:38 +01:00
Dimitri Fontaine	8a596ca933	Move connection into utils. There's no reason why this file should be in the src/ top-level.	2016-01-07 16:42:43 +01:00
Dimitri Fontaine	d1a2e3f46b	Improve the Dockerfile and the versioning. When building from sources within the git environement, the version number is ok, but it was wrong when building in the docker image. Fix the version number to 3.3.0.50 to show that we're talking about a development snapshot that is leading to version 3.3.1. Yeah, 4 parts version numbers. That happens, apparently.	2016-01-07 10:21:52 +01:00
Dimitri Fontaine	ee2a68f924	Improve Dockerfile. It was quite idiotic to RUN a git clone rather than just use the files from the docker context...	2016-01-05 11:28:19 +01:00
Dimitri Fontaine	286a39f6e6	Proof read of the README.md file. Some advice was pretty ancient, and we should now mention debian packaging support and the docker hub image.	2016-01-04 23:22:52 +01:00
Dimitri Fontaine	f8cb7601c5	Implement a Dockerfile. Apparently it's quite common nowadays for people to use docker to build and run software in a contained way, so provide users with the facility they need in order to do that.	2016-01-04 21:05:46 +01:00
Dimitri Fontaine	1bbbf96ba7	Fix minor API glitch/typo.	2016-01-04 21:01:15 +01:00
Dimitri Fontaine	a7291e9b4b	Simplify copy-database implementation further. Following-up to the recent refactoring effort, the IXF and DB3 source classes didn't get the memo that they could piggyback on the generic copy-database implementation. This patch implements that. In passing, also simplify the instanciate-table-copy-object method for copy subclasses that need specialization here, by using change-class and call-next-method so as to reuse the generic code as much as possible.	2016-01-01 14:28:09 +01:00
Dimitri Fontaine	24cd0de9f7	Install the :create-schemas option back. In the previous refactoring patch that option mistakenly went away, although it is still needed for MS SQL and it is planned to make use of it in the other source types too... See #316 for reference.	2016-01-01 13:35:35 +01:00
Dimitri Fontaine	9e4938cea4	Implement PostgreSQL catalogs data structure. In order to share more code in between the different source types, finally have a go at the quite horrible mess of anonymous data structures floating around. Having a catalog and schema instances not only allows for code cleanup, but will also allow to implement some bug fixes and wishlist items such as mapping tables from a schema to another one. Also, supporting database sources having a notion of "schema" (in between "catalog" and "table") should get easier, including getting on-par with MySQL in the MS SQL support (materialized views has been asked for already). See #320, #316, #224 for references and a notion of progress being made. In passing, also clean up the copy-databases methods for database source types, so that they all use a fetch-metadata generic function and a prepare-pgsql-database and a complete-pgsql-database generic function. Actually, a single method does the job here. The responsibility of introspecting the source to populate the internal catalog/schema representation is now held by the fetch-metadata generic function, which in turn will call the specialized versions of list-all-columns and friends implementations. Once the catalog has been fetched, an explicit CAST call is then needed before we can continue. Finally, the fields/columns/transforms slots in the copy objects are still being used by the operative code, so the internal catalog representation is only used up to starting the data copy step, where the copy class instances are then all that's used. This might be refactored again in a follow-up patch.	2015-12-30 21:53:01 +01:00
Dimitri Fontaine	d84ec3f808	Add SQLite test case for before/after load commands. See bug #321, this change should have been part of previous commit.	2015-12-23 21:58:56 +01:00
Dimitri Fontaine	8355b2140e	Implement before/after load support for SQLite, fix #321 . If there ever was a good reason not to implement before/after support for SQLite, it's no longer valid: done.	2015-12-23 21:56:10 +01:00
Dimitri Fontaine	72e7c2af70	At long last, log cast rule choices, see #317 . To help debug the casting rule choices, output a line for each decision that is made with the input and the output of the decision.	2015-12-08 21:27:33 +01:00
Dimitri Fontaine	735cdc8fdc	Document the remove-null-characters transform. Both as a new transformation function available, and as the default for Text conversions when coming from MySQL. See #258, Fixes #219.	2015-12-08 21:04:47 +01:00
Aliaksei Urbanski	3a55d80411	Default cast rules for MySQL's text types fixed, see #219	2015-12-08 20:59:29 +01:00
Dimitri Fontaine	b4bfa18877	Fix more table name quoting, fix #163 again. Now that we can have several threads doing COPY, each of them need to know about the pgsql-reserved-keywords list. Make sure that's the case and in passing fix some call sites to apply-identifier-case. Also, more disturbingly, fix the code so that TRUNCATE is called from the main thread before giving control to the COPY threads, rather than having two concurrent threads doing the TRUNCATE twice. It's rather strange that we got no complaint from the field on that part...	2015-12-08 11:52:43 +01:00
Dimitri Fontaine	dca3dacf4b	Don't issue useless MySQL catalog queries... When the option "WITH no foreign keys" is in use, it's not necessary to go read the foreign key information_schema bits at all, so just don't issue the query, and same thing with the "create no indexes" option. In no that old versions of MySQL, the referential_constraints table of information_schema doesn't exist, so this should make pgloader compatible with MySQL 5.0 something and earlier.	2015-12-03 19:24:00 +01:00
Dimitri Fontaine	7c64a713d0	Fix PostgreSQL write times in the summary. It turns out the summary write times included time spent waiting for batches to be ready, which isn't fair to PostgreSQL COPY implementation, and moreover doesn't help figuring out the bottlenecks...	2015-11-29 23:23:30 +01:00
Dimitri Fontaine	cca44c800f	Simplify batch and transformation handling. Make batches of raw data straight from the reader output (map-rows) and have the transformation worker focus on changing the batch content from raw rows to copy strings. Also review the organisation of responsabilities in the code, allowing to move queue.lisp into utils/batch.lisp, renaming it as its scope has been reduced to only care about preparing batches. This came out of trying to have multiple workers concurrently processing the batches from the reader and feeding the hardcoded 2 COPY workers, but it failed for multiple reasons. All is left as of now is this cleanup, which seems to be on the faster side of things, which is always good.	2015-11-29 17:35:25 +01:00
Dimitri Fontaine	2dd7f68a30	Fix index completion management in MySQL and SQLite. We used to wait for the wrong number of workers, meaning the rest of the code began running before the indexes where all available. A user report where one of the indexes takes a very long time to compute made it obvious. In passing, also improve reporting of those rendez-vous sections.	2015-11-29 17:29:57 +01:00
Dimitri Fontaine	af9e423f0b	Fix errors counting, see #313 . It came pretty obvious that the error counting was broken, it happens that I forgot to pass the information down to the state handling parts of the code. In passing improve and fix CSV parse errors counting and fatal errors reporting.	2015-11-27 11:51:48 +01:00
Dimitri Fontaine	533a49a261	Handle PostgreSQL notifications early, fix #311 . In some cases, like when client_min_messages is set to debug5, PostgreSQL might send notification messages to the connecting client even while opening a connection. Those are still considered WARNINGs by the Postmodern driver... Handle those warnings by just printing them out in the pgloader logs, rather than considering those conditions as hard failures (signaling a db-connection-error).	2015-11-24 10:52:25 +01:00
Dimitri Fontaine	93b6be43d4	Travis: adapt to PostgreSQL 9.1, again. We didn't have CREATE SCHEMA IF EXISTS at the time...	2015-11-23 22:09:08 +01:00
Dimitri Fontaine	f109c3fdb4	Travis: prepare an "err" schema. The test/errors.load set the search_path to include the 'err' schema, which is to be created by the test itself. PostgreSQL 9.1 raises an error where 9.4 and following just accept the setting, and Travis runs a 9.1 PostgreSQL. Let's just create the schema before-hand so that we can still run tests against SET search_path from the load file.	2015-11-23 15:26:18 +01:00
Dimitri Fontaine	4e23de1b2b	Missing file from previous commit. Somehow it still happens :/	2015-11-22 23:32:22 +01:00
Dimitri Fontaine	973339abc8	Add a SQLite test case from #310 .	2015-11-22 22:16:05 +01:00
Dimitri Fontaine	e23de0ce9f	Improve SQLite table names filtering. Filter the list of tables we migrate directly from the SQLite query, avoiding to return useless data. To do that, use the LIKE pattern matching supported by SQLite, where the REGEX operator is only available when extra features are loaded apparently. See #310 where filtering out the view still caused errors in the loading.	2015-11-22 22:10:26 +01:00
Dimitri Fontaine	a81f017222	Review SQLite integration with recent changes. The current way to do parallelism in pgloader was half baked in the SQLite source implementation, get it up to speed again.	2015-11-22 21:30:20 +01:00
Dimitri Fontaine	5f60ce3d96	Travis: create the "expected" schema. As we don't use the `make -C test prepare` target, reproduce a missing necessary precondition to running the unit tests.	2015-11-21 21:19:37 +01:00
Dimitri Fontaine	bc870ac96c	Use 2 copy threads per target table. It's been proven by Andres Freund benchmarks that the best number of parallel COPY threads concurrently active against a single table is 2 as of PostgreSQL current development version (up to 9.5 stable, might still apply to 9.6 depending on how we might solve the problem). Henceforth hardcode 2 COPY threads in pgloader. This also has the advantage that in the presence of lots of bad rows, we should sustain a better throughtput and not stall completely. Finally, also improve the MySQL setup to use 8 threads by default, that is be able to load two tables concurrently, each with 2 COPY workers, a reader and a transformer thread. It's all still experimental as far as performances go, next patches should bring the capability to configure the parallelism wanted from the command line and load command tho. Also, other source types will want to benefit from the same capabilities, it just happens that it's easier to play with MySQL first for some reasons here.	2015-11-17 17:06:36 +01:00
Dimitri Fontaine	150d288d7a	Improve our regression testing facility. Next parallelism improvements will allow pgloader to use more than one COPY thread to load data, with the impact of changing the order of rows in the database. Rather than doing a copy out and `diff` of the data just loaded, load the reference data and do the diff in SQL: select * from loaded.data except select * from expected.data If such a query returns any row, we know we didn't load what was expected and the regression test is failing. This regression testing facility should also allow us to finally add support for multiple-table regression tests (sqlite, mysql, etc).	2015-11-17 17:03:08 +01:00

... 10 11 12 13 14 ...

1568 Commits