pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2026-02-10 08:51:02 +01:00

Author	SHA1	Message	Date
Dimitri Fontaine	d72c711b45	Implement support for on update CURRENT_TIMESTAMP. That's the MySQL slang for a simple ON UPDATE trigger, and that's what pgloader nows translate the expression to. Fix #195.	2016-03-27 01:01:40 +01:00
Dimitri Fontaine	fcc6e8f813	Implement ALTER SCHEMA ... RENAME TO... That's only available for MS SQL as of now, as it's the only source database we have where the notion of a schema makes sense. Fix #224.	2016-03-26 20:25:03 +01:00
Dimitri Fontaine	6f078daeb9	Ensure logging of errors. The first error of a batch was lost somewhere in the recent changes. My current best guess is that the rewrite of the copy-batch function made the handler-bind form setup by the handling-pgsql-notices macro ineffective, but I can't see why that is. See #85.	2016-03-26 17:51:38 +01:00
Dimitri Fontaine	e2fcd86868	Handle failure to convert index filters gracefully. We should not block any processing just because we can't parse an index. The best we can do just tonight is to try creating the index without the filter, ideally we would have to skip building the index entirely. That's for a later effort though, it's running late here. See #365.	2016-03-22 00:29:25 +01:00
Dimitri Fontaine	5e18cfd7d4	Implement support for partial indexes. MS SQL has a notion of a "filtered index" that matches the notion of a PostgreSQL partial index: the index only applies to the rows matching the index WHERE clause, or filter. The WHERE clause in both case are limited to simple expressions over a base table's row at a time, so we implement a limited WHERE clause parser for MS SQL filters and a transformation routine to rewrite the clause in PostgreSQL slang. In passing, we transform the filter constants using the same transformation functions as in the CAST rules, so that e.g. a MS SQL bit(1) value that got transformed into a PostgreSQL boolean is properly translated, as in the following example: MS SQL: "([deleted]=(0))" (that's from the catalogs) PostgreSQL: deleted = 'f' Of course the parser is still very badly tested, let's see what happens in the wild now. (Should) Fix #365.	2016-03-21 23:39:45 +01:00
Dimitri Fontaine	1ed07057fd	Implement --on-error-stop command line option. The implementation uses the dynamic binding on-error-stop so it's also available when pgloader is used as Common Lisp librairy. The (not-all-that-) recent changes made to the error handling make that implementation straightforward enough, so let's finally do it! Fix #85.	2016-03-21 20:52:50 +01:00
Dimitri Fontaine	8476c1a359	Allow setting search_path with multiple schemas. The PostgreSQL search_path allows multiple schemas and might even need it to be able to reference types and other tables. Allow setting more than one schema by using the fact that PostgreSQL schema names don't need to be individually quoted, and passing down the exact content of the SET search_path value down to PostgreSQL. Fix #359.	2016-03-20 20:54:08 +01:00
Dimitri Fontaine	3e8b7df0d3	Improve column formatting. Have a pretty-print option where we try to be nice for the reader, and don't use it in the CAST debug messages. Also allow working with the real maximum length of column names rather than hardcoding 22 cols...	2016-03-16 21:46:41 +01:00
Dimitri Fontaine	f1fe9ab702	Assorted fixes to MS SQL support. Having been given a test instance of a MS SQL database allows to quickly fix a series of assorted bugs related to schema handling of MS SQL databases. As it's the only source with a proper notion of schema that pgloader supports currently, it's not a surprise we had them. Fix #343. Fix #349. Fix #354.	2016-03-16 21:43:04 +01:00
Dimitri Fontaine	c724018840	Implement ALTER TABLE clause for MySQL migrations. The new ALTER TABLE facility allows to act on tables found in the MySQL database before the migration happens. In this patch the only provided actions are RENAME TO and SET SCHEMA, which fixes #224. In order to be able to provide the same option for MS SQL users, we will have to make it work at the SCHEMA level (ALTER SCHEMA ... RENAME TO ...) and modify the internal schema-struct so that the schema slot of our table instances are a schema instance rather than its name. Lacking MS SQL test database and instance, the facility is not yet provided for that source type.	2016-03-06 21:51:33 +01:00
Dimitri Fontaine	40c1581794	Review transaction and error handling in COPY. The PostgreSQL COPY protocol requires an explicit initialization phase that may fail, and in this case the Postmodern driver transaction is already dead, so there's no way we can even send ABORT to it. Review the error handling of our copy-batch function to cope with that fact, and add some logging of non-retryable errors we may have. Also improve the thread error reporting when using a binary image from where it might be difficult to open an interactive debugger, while still having the full blown Common Lisp debugging experience for the project developers. Add a test case for a missing column as in issue #339. Fix #339, see #337.	2016-02-21 15:56:06 +01:00
Dimitri Fontaine	76668c2626	Review package dependencies. The decision to use lots of different packages in pgloader has quite strong downsides at times, and the manual managment of dependencies is one of the, in particular how to avoid circular ones.	2016-01-31 18:42:01 +01:00
Dimitri Fontaine	d9d9e06c0f	Another attempt at fixing #323 . Rather than trying hard to have PostgreSQL fully qualify the index name with tricks around search_path setting at the time ::regclass is executed, simply join on pg_namespace to retrieve that schema in a new slot in our pgsql-index structure so that we can then reuse it when needed. Also add a test case for the scenario, including both a UNIQUE constraint and a classic index, because the DROP and CREATE/ALTER instructions differ.	2016-01-17 01:54:36 +01:00
Dimitri Fontaine	7dd69a11e1	Implement concurrency and workers for files sources. More than the syntax and API tweaks, this patch also make it so that a multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be loaded with several files in the group in parallel. To that effect, tweak again the md-connection and md-copy implementations.	2016-01-16 22:53:55 +01:00
Dimitri Fontaine	bfdbb2145b	Fix with drop index option, fix #323 . Have PostgreSQL always fully qualify the index related objects and SQL definition statements when fetching the list of indexes of a table, by playing with an empty search_path. Also improve the whole index creation by passing the table object as the context where to derive the table-name from, so that schema qualified tables are taken into account properly.	2016-01-15 15:04:07 +01:00
Dimitri Fontaine	1ff204c172	Typo fix.	2016-01-15 14:45:19 +01:00
Dimitri Fontaine	133028f58d	Desultory review code indentation.	2016-01-12 14:52:44 +01:00
Dimitri Fontaine	94ef8674ec	Typo fix (of sorts) Some API didn't get the table-name to table memo...	2016-01-11 01:42:18 +01:00
Dimitri Fontaine	a3fd22acd3	Review pgloader encoding story. Thanks to Common Lisp character data type, it's easy for pgloader to enforce always speaking to PostgreSQL in utf-8, and that's what has been done from the beginning actually. Now, without good reason for that, the first example of a SET clause that has been added to the docs where about how to set client_encoding, which should NOT be done. Fix that at the use level by removing the bad example from the docs and adding a WARNING whenever the client_encoding is set to a known bad value. It's a WARNING because we then simply force 'utf-8' anyway. Also, review completely the format-vector-row function to avoid doing double work with the Postmodern facilities we piggyback on. This was done halfway through and the utf-8 conversion was actually done twice.	2016-01-11 01:27:36 +01:00
Dimitri Fontaine	9e4938cea4	Implement PostgreSQL catalogs data structure. In order to share more code in between the different source types, finally have a go at the quite horrible mess of anonymous data structures floating around. Having a catalog and schema instances not only allows for code cleanup, but will also allow to implement some bug fixes and wishlist items such as mapping tables from a schema to another one. Also, supporting database sources having a notion of "schema" (in between "catalog" and "table") should get easier, including getting on-par with MySQL in the MS SQL support (materialized views has been asked for already). See #320, #316, #224 for references and a notion of progress being made. In passing, also clean up the copy-databases methods for database source types, so that they all use a fetch-metadata generic function and a prepare-pgsql-database and a complete-pgsql-database generic function. Actually, a single method does the job here. The responsibility of introspecting the source to populate the internal catalog/schema representation is now held by the fetch-metadata generic function, which in turn will call the specialized versions of list-all-columns and friends implementations. Once the catalog has been fetched, an explicit CAST call is then needed before we can continue. Finally, the fields/columns/transforms slots in the copy objects are still being used by the operative code, so the internal catalog representation is only used up to starting the data copy step, where the copy class instances are then all that's used. This might be refactored again in a follow-up patch.	2015-12-30 21:53:01 +01:00
Dimitri Fontaine	b4bfa18877	Fix more table name quoting, fix #163 again. Now that we can have several threads doing COPY, each of them need to know about the pgsql-reserved-keywords list. Make sure that's the case and in passing fix some call sites to apply-identifier-case. Also, more disturbingly, fix the code so that TRUNCATE is called from the main thread before giving control to the COPY threads, rather than having two concurrent threads doing the TRUNCATE twice. It's rather strange that we got no complaint from the field on that part...	2015-12-08 11:52:43 +01:00
Dimitri Fontaine	7c64a713d0	Fix PostgreSQL write times in the summary. It turns out the summary write times included time spent waiting for batches to be ready, which isn't fair to PostgreSQL COPY implementation, and moreover doesn't help figuring out the bottlenecks...	2015-11-29 23:23:30 +01:00
Dimitri Fontaine	cca44c800f	Simplify batch and transformation handling. Make batches of raw data straight from the reader output (map-rows) and have the transformation worker focus on changing the batch content from raw rows to copy strings. Also review the organisation of responsabilities in the code, allowing to move queue.lisp into utils/batch.lisp, renaming it as its scope has been reduced to only care about preparing batches. This came out of trying to have multiple workers concurrently processing the batches from the reader and feeding the hardcoded 2 COPY workers, but it failed for multiple reasons. All is left as of now is this cleanup, which seems to be on the faster side of things, which is always good.	2015-11-29 17:35:25 +01:00
Dimitri Fontaine	bc870ac96c	Use 2 copy threads per target table. It's been proven by Andres Freund benchmarks that the best number of parallel COPY threads concurrently active against a single table is 2 as of PostgreSQL current development version (up to 9.5 stable, might still apply to 9.6 depending on how we might solve the problem). Henceforth hardcode 2 COPY threads in pgloader. This also has the advantage that in the presence of lots of bad rows, we should sustain a better throughtput and not stall completely. Finally, also improve the MySQL setup to use 8 threads by default, that is be able to load two tables concurrently, each with 2 COPY workers, a reader and a transformer thread. It's all still experimental as far as performances go, next patches should bring the capability to configure the parallelism wanted from the command line and load command tho. Also, other source types will want to benefit from the same capabilities, it just happens that it's easier to play with MySQL first for some reasons here.	2015-11-17 17:06:36 +01:00
Dimitri Fontaine	150d288d7a	Improve our regression testing facility. Next parallelism improvements will allow pgloader to use more than one COPY thread to load data, with the impact of changing the order of rows in the database. Rather than doing a copy out and `diff` of the data just loaded, load the reference data and do the diff in SQL: select * from loaded.data except select * from expected.data If such a query returns any row, we know we didn't load what was expected and the regression test is failing. This regression testing facility should also allow us to finally add support for multiple-table regression tests (sqlite, mysql, etc).	2015-11-17 17:03:08 +01:00
Dimitri Fontaine	6cbec206af	Turns out SSL key/crt file paths should be strings. Our PostgreSQL driver uses CFFI to load the SSL support from open ssl and as a result the certificate and key file names should be strings rather than pathnames. Should fix #308 again...	2015-11-11 23:10:29 +01:00
Dimitri Fontaine	f8ae9f22b9	Implement support for SSL client certificates. This fixes #308 by automatically using the PostgreSQL Client Side SSL files as documented in the following reference: http://www.postgresql.org/docs/current/static/libpq-ssl.html#LIBPQ-SSL-FILE-USAGE This uses the Postmodern special support for it. Unfortunately couldn't test it locally other than it doesn't break non-ssl connections. Pushing to have user feedback.	2015-11-09 11:32:17 +01:00
Dimitri Fontaine	4df3167da1	Introduce another worker thread: transformers. We used to have a reader and a writer cooperating concurrently into loading the data from the source to PostgreSQL. The tranformation of the data was then the responsibility of the reader thread. Measurements showed that the PostgreSQL processes were mostly idle, waiting for the reader to produce data fast enough. In this patch we introduce a third worker thread that is responsible for processing the raw data into pre-formatted batches, allowing the reader to focus on extracting the data only. We now have two lparallel queues involved in the processing, the raw queue contains the vectors of raw data directly, and the processed-queue contains batches of properly encoded strings for the COPY text protocol. On the test laptop the performance gain isn't noticeable yet, it might be that we need much larger data sets to see a gain here. At least the setup isn't detrimental to performances on smaller data sets. Next improvements are going to allow more features: specialized batch retry thread and parallel table copy scheduling for database sources. Let's also continue caring about performances and play with having several worker and writer threads for each reader. In later patches. And some day, too, we will need to make the number of workers a user defined variable rather than something hard coded as today. It's on the todo list, meanwhile, dear user, consider changing the (make-kernel 6) into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp, and consider enlighting me with whatever it is you find by doing so!	2015-10-23 00:17:58 +02:00
Dimitri Fontaine	4f3b3472a2	Cleanup and timing display improvements. Have each thread publish its own start-time so that the main thread may compute time spent in source and target processing, in order to fix the crude hack of taking (max read-time write-time) in the total time column of the summary. We still have some strange artefacts here: we consider that the full processing time is bound to the writer thread (:target), because it needs to have the reader done already to be able to COPY the last batch... but in testing I've seen some :source timings higher than the :target ones... Let's solve problems one at a time tho, I guess multi-threading and accurate wall clock times aren't to be expected to mix and match that easily anyway (multi cores, single RTC and all that).	2015-10-22 22:36:40 +02:00
Dimitri Fontaine	1fb69b2039	Retry connecting to PostgreSQL in some cases. Now that we can setup many concurrent threads working against the PostgreSQL database, and before we open the number of workers to our users, install an heuristic to manage the PostgreSQL error classes “too many connection” and “configuration limit exceeded” so that pgloader waits for some time (retry-connect-delay) then tries connecting again. It's quite simplistic but should cover lots of border-line cases way more nicely than just throwing the interactive debugger at the end user.	2015-10-20 23:15:05 +02:00
Dimitri Fontaine	633067a0fd	Allow more parallelism in database migrations. The newly added statistics are showing that read+write times are not enough to explain how long we wait for the data copying, so it must be the workers setup rather than the workers themselves. From there, let lparallel work its magic in scheduling the work we do in parallel in pgloader: rather than doing blocking receive-result calls for each table, only receive-result at the end of the whole copy-database processing. On test data here on the laptop we go from 6s to 3s to migrate the sakila database from MySQL to PostgreSQL: that's because we have lots of very small tables, so the cost of waiting after each COPY added up quite quickly. In passing, stop sharing the same connection object in between parallel workers that used to be controlled active in-sequence, see the new API clone-connection (which takes over new-pgsql-connection).	2015-10-20 22:15:55 +02:00
Dimitri Fontaine	187565b181	Add read/write separate stats. Add metrics to devise where the time is spent in current pgloader code so that it's possible to then optimize away the batch processing as we do it today. Given the following extract of the measures, it seems that doing the data transformations in the reader thread isn't so bright an idea. More to come. table name total time read write ----------------- -------------- --------- --------- extract 2.014s before load 0.050s fetch 0.000s ----------------- -------------- --------- --------- geolite.location 16.090s 15.933s 5.732s geolite.blocks 28.896s 28.795s 5.312s ----------------- -------------- --------- --------- after load 37.772s ----------------- -------------- --------- --------- Total import time 1m25.082s 44.728s 11.044s	2015-10-11 21:35:19 +02:00
Dimitri Fontaine	96a33de084	Review the stats and reporting code organisation. In order to later be able to have more worker threads sharing the load (multiple readers and/or writers, maybe more specialized threads too), have all the stats be managed centrally by a single thread. We already have a "monitor" thread that get passed log messages so that the output buffer is not subject to race conditions, extend its use to also deal with statistics messages. In the current code, we send a message each time we read a row. In some future commits we should probably reduce the messaging here to something like one message per batch in the common case. Also, as a nice side effect of the code simplification and refactoring this fixes #283 wherein the before/after sections of individual CSV files within an ARCHIVE command where not counted in the reporting.	2015-10-05 01:46:29 +02:00
Dimitri Fontaine	a0dc59624c	Fix schema qualified table names usage again. When the list of columns of the PostgreSQL target table isn't given in the load command, pgloader will happily query the system catalogs to get that information. The list-columns query didn't get the memo about the qualified table name format and the with-schema macro... fix #288.	2015-09-11 11:53:28 +02:00
Victor Kryukov	792db2fcf4	Better regexp for PG SQL identifiers	2015-09-04 01:15:45 +02:00
Victor Kryukov	54997be2dd	Fix 182: properly quote tables with . in their names	2015-09-04 01:15:29 +02:00
Dimitri Fontaine	eabfbb9cc8	Fix schema qualified table names usage (more). When parsing table names in the target URI, we are careful of splitting the table and schema name and store them into a cons in that case. Not all sources methods got the memo, clean that up. See #182 and #186, a pull request I am now going to be able to accept. Also see #287 that should be helped by being able to apply #186.	2015-09-04 01:06:15 +02:00
Dimitri Fontaine	ea35eb575d	Implement --dry-run option, fix #264 . The dry run option will currently only check database connections, but as that happens after having correctly parsed the load file, it allows to also check that the command file is correct for the parser. Note that the list load-data API isn't subject to the dry-run method. In passing, we add some more API entry points to the connection objects and we should actually clean the code base to use the new QUERY generic all over the place. It's for another patch tho.	2015-08-22 16:23:47 +02:00
Dimitri Fontaine	04ddf940d9	Left pad COPY octal chars with 0, fix #275 . The COPY TEXT format accepts non printable characters with an escaped sequence wherin pgloader can pass in the octal number for the character in its encoding. When doing that with small numbers like \6 and the non-printable character is then followed by other numbers, then it becomes e.g. \646 which might not be part of the target encoding... To fix, always left pad the character octal number with zeroes, so that we now send in \00646 which COPY knows how to read: the char at \006 then 4 then 6. Also copy the test case over to pgloader and run it in the test suite.	2015-08-20 18:17:18 +02:00
Dimitri Fontaine	3a6120b931	Improve logging again. The user experience is greatly enhanced by this little change, where you know from the logs that pgloader could actually connect rather than thinking it might be still trying...	2015-08-20 12:38:19 +02:00
Dimitri Fontaine	5e7e5391ef	Fix the drop indexes option again, fix #251 . The index and constraint names given by PostgreSQL catalogs should not be second guessed, we need to just quote them. The identifier down casing is interesting when we get identifiers from other system for a migration, but are wrong when dropping existing indexes in PostgreSQL. Also, the :null special value from Postmodern was routing the code badly, just transform it manually to nil when fetching the index list, manually.	2015-07-26 15:38:15 +02:00
Dimitri Fontaine	3af99051d2	Fix the preserve index names option. MySQL names its primary keys "PRIMARY" and we need to always uniquify this name even when the used asked pgloader to preserve index names. Also, the create-indexes-again function now needs to ask for index names to be preserved specifically.	2015-07-18 23:39:32 +02:00
Dimitri Fontaine	54e29773d7	Fix index creation reporting, see #251 . The new option 'drop indexes' reuses the existing code to build all the indexes in parallel but failed to properly account for that fact in the summary report with timings. While fixing this, also fix the SQL used to re-establish the indexes and associated constraints to allow for parallel execution, the ALTER TABLE statements would block in ACCESS EXCLUSIVE MODE otherwise and make our efforts vain.	2015-07-18 23:06:15 +02:00
Dimitri Fontaine	8511294ac7	Generalize index support to handle constraints, fix #251 . PostgreSQL rightfully forbifs DROP INDEX when the index is used to enforce a constraint, the proper SQL to issue is then ALTER TABLE DROP CONSTRAINT. Also, in such a case pg_dump issues a single ALTER TABLE ADD CONSTRAINT statement to restore the situation. Have pgloader do the same with indexes that are used to back a constraint.	2015-07-17 17:06:09 +02:00
Dimitri Fontaine	4f2385fa4c	Refactor code from previous commit. The goal is to make it easy to add support for the 'drop indexes' option in other source types (fixed, ixf, db3, file-based sources).	2015-07-16 19:35:34 +02:00
Dimitri Fontaine	49bf7e56f2	Implement a "drop indexes" option in CSV mode, fix #251 . When loading against a table that already has index definitions, the load can be quite slow. Previous commit introduced a warning in such a case. This commit introduces the option "drop indexes" that is not used by default. When this option is used, pgloader drops the indexes before loading the data then create the indexes again with the same definitions as before. All the indexes are created again in parallel to optimize performances. Only primary key indexes can't be created in parallel, so those are created in two steps (create unique index then alter table).	2015-07-16 12:22:58 +02:00
Dimitri Fontaine	7c834db6e3	Warn against pre-existing indexes. Pre-existing indexes will reduce data loading performances and it's generally better to DROP the index prior to the load and CREATE them again once the load is done. See #251 for an example of that. In that patch we just add a WARNING against the situation, the next patch will also add support for a new WITH clause option allowing to have pgloader take care of the DROP/CREATE dance around the data loading.	2015-07-16 12:22:58 +02:00
Dimitri Fontaine	88c801997e	Default to a static list of PostgreSQL keywords. In some cases (such as when using a very old PostgreSQL instance or an Amazon Redshift service, as in #255), the function pg_get_keywords() does not exists but we assume that pgloader might still be able to complete its job. We're better off with a static list of keywords than with a unhandled error here, so let's see what happens next with Redshift.	2015-07-04 20:16:50 +02:00
Alex Baretta	49dcae8068	Fix DROP TABLE statements on tables with foreign keys	2015-05-28 14:24:28 -07:00
Alex Baretta	0626d56303	Fix identifier case in FOREIGN KEY constraints	2015-05-25 18:08:37 -07:00

1 2 3

126 Commits