pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2026-02-11 09:21:01 +01:00

Author	SHA1	Message	Date
Dimitri Fontaine	150d288d7a	Improve our regression testing facility. Next parallelism improvements will allow pgloader to use more than one COPY thread to load data, with the impact of changing the order of rows in the database. Rather than doing a copy out and `diff` of the data just loaded, load the reference data and do the diff in SQL: select * from loaded.data except select * from expected.data If such a query returns any row, we know we didn't load what was expected and the regression test is failing. This regression testing facility should also allow us to finally add support for multiple-table regression tests (sqlite, mysql, etc).	2015-11-17 17:03:08 +01:00
Dimitri Fontaine	6cbec206af	Turns out SSL key/crt file paths should be strings. Our PostgreSQL driver uses CFFI to load the SSL support from open ssl and as a result the certificate and key file names should be strings rather than pathnames. Should fix #308 again...	2015-11-11 23:10:29 +01:00
Dimitri Fontaine	f8ae9f22b9	Implement support for SSL client certificates. This fixes #308 by automatically using the PostgreSQL Client Side SSL files as documented in the following reference: http://www.postgresql.org/docs/current/static/libpq-ssl.html#LIBPQ-SSL-FILE-USAGE This uses the Postmodern special support for it. Unfortunately couldn't test it locally other than it doesn't break non-ssl connections. Pushing to have user feedback.	2015-11-09 11:32:17 +01:00
Dimitri Fontaine	4df3167da1	Introduce another worker thread: transformers. We used to have a reader and a writer cooperating concurrently into loading the data from the source to PostgreSQL. The tranformation of the data was then the responsibility of the reader thread. Measurements showed that the PostgreSQL processes were mostly idle, waiting for the reader to produce data fast enough. In this patch we introduce a third worker thread that is responsible for processing the raw data into pre-formatted batches, allowing the reader to focus on extracting the data only. We now have two lparallel queues involved in the processing, the raw queue contains the vectors of raw data directly, and the processed-queue contains batches of properly encoded strings for the COPY text protocol. On the test laptop the performance gain isn't noticeable yet, it might be that we need much larger data sets to see a gain here. At least the setup isn't detrimental to performances on smaller data sets. Next improvements are going to allow more features: specialized batch retry thread and parallel table copy scheduling for database sources. Let's also continue caring about performances and play with having several worker and writer threads for each reader. In later patches. And some day, too, we will need to make the number of workers a user defined variable rather than something hard coded as today. It's on the todo list, meanwhile, dear user, consider changing the (make-kernel 6) into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp, and consider enlighting me with whatever it is you find by doing so!	2015-10-23 00:17:58 +02:00
Dimitri Fontaine	4f3b3472a2	Cleanup and timing display improvements. Have each thread publish its own start-time so that the main thread may compute time spent in source and target processing, in order to fix the crude hack of taking (max read-time write-time) in the total time column of the summary. We still have some strange artefacts here: we consider that the full processing time is bound to the writer thread (:target), because it needs to have the reader done already to be able to COPY the last batch... but in testing I've seen some :source timings higher than the :target ones... Let's solve problems one at a time tho, I guess multi-threading and accurate wall clock times aren't to be expected to mix and match that easily anyway (multi cores, single RTC and all that).	2015-10-22 22:36:40 +02:00
Dimitri Fontaine	1fb69b2039	Retry connecting to PostgreSQL in some cases. Now that we can setup many concurrent threads working against the PostgreSQL database, and before we open the number of workers to our users, install an heuristic to manage the PostgreSQL error classes “too many connection” and “configuration limit exceeded” so that pgloader waits for some time (retry-connect-delay) then tries connecting again. It's quite simplistic but should cover lots of border-line cases way more nicely than just throwing the interactive debugger at the end user.	2015-10-20 23:15:05 +02:00
Dimitri Fontaine	633067a0fd	Allow more parallelism in database migrations. The newly added statistics are showing that read+write times are not enough to explain how long we wait for the data copying, so it must be the workers setup rather than the workers themselves. From there, let lparallel work its magic in scheduling the work we do in parallel in pgloader: rather than doing blocking receive-result calls for each table, only receive-result at the end of the whole copy-database processing. On test data here on the laptop we go from 6s to 3s to migrate the sakila database from MySQL to PostgreSQL: that's because we have lots of very small tables, so the cost of waiting after each COPY added up quite quickly. In passing, stop sharing the same connection object in between parallel workers that used to be controlled active in-sequence, see the new API clone-connection (which takes over new-pgsql-connection).	2015-10-20 22:15:55 +02:00
Dimitri Fontaine	187565b181	Add read/write separate stats. Add metrics to devise where the time is spent in current pgloader code so that it's possible to then optimize away the batch processing as we do it today. Given the following extract of the measures, it seems that doing the data transformations in the reader thread isn't so bright an idea. More to come. table name total time read write ----------------- -------------- --------- --------- extract 2.014s before load 0.050s fetch 0.000s ----------------- -------------- --------- --------- geolite.location 16.090s 15.933s 5.732s geolite.blocks 28.896s 28.795s 5.312s ----------------- -------------- --------- --------- after load 37.772s ----------------- -------------- --------- --------- Total import time 1m25.082s 44.728s 11.044s	2015-10-11 21:35:19 +02:00
Dimitri Fontaine	96a33de084	Review the stats and reporting code organisation. In order to later be able to have more worker threads sharing the load (multiple readers and/or writers, maybe more specialized threads too), have all the stats be managed centrally by a single thread. We already have a "monitor" thread that get passed log messages so that the output buffer is not subject to race conditions, extend its use to also deal with statistics messages. In the current code, we send a message each time we read a row. In some future commits we should probably reduce the messaging here to something like one message per batch in the common case. Also, as a nice side effect of the code simplification and refactoring this fixes #283 wherein the before/after sections of individual CSV files within an ARCHIVE command where not counted in the reporting.	2015-10-05 01:46:29 +02:00
Dimitri Fontaine	a0dc59624c	Fix schema qualified table names usage again. When the list of columns of the PostgreSQL target table isn't given in the load command, pgloader will happily query the system catalogs to get that information. The list-columns query didn't get the memo about the qualified table name format and the with-schema macro... fix #288.	2015-09-11 11:53:28 +02:00
Victor Kryukov	792db2fcf4	Better regexp for PG SQL identifiers	2015-09-04 01:15:45 +02:00
Victor Kryukov	54997be2dd	Fix 182: properly quote tables with . in their names	2015-09-04 01:15:29 +02:00
Dimitri Fontaine	eabfbb9cc8	Fix schema qualified table names usage (more). When parsing table names in the target URI, we are careful of splitting the table and schema name and store them into a cons in that case. Not all sources methods got the memo, clean that up. See #182 and #186, a pull request I am now going to be able to accept. Also see #287 that should be helped by being able to apply #186.	2015-09-04 01:06:15 +02:00
Dimitri Fontaine	ea35eb575d	Implement --dry-run option, fix #264 . The dry run option will currently only check database connections, but as that happens after having correctly parsed the load file, it allows to also check that the command file is correct for the parser. Note that the list load-data API isn't subject to the dry-run method. In passing, we add some more API entry points to the connection objects and we should actually clean the code base to use the new QUERY generic all over the place. It's for another patch tho.	2015-08-22 16:23:47 +02:00
Dimitri Fontaine	04ddf940d9	Left pad COPY octal chars with 0, fix #275 . The COPY TEXT format accepts non printable characters with an escaped sequence wherin pgloader can pass in the octal number for the character in its encoding. When doing that with small numbers like \6 and the non-printable character is then followed by other numbers, then it becomes e.g. \646 which might not be part of the target encoding... To fix, always left pad the character octal number with zeroes, so that we now send in \00646 which COPY knows how to read: the char at \006 then 4 then 6. Also copy the test case over to pgloader and run it in the test suite.	2015-08-20 18:17:18 +02:00
Dimitri Fontaine	3a6120b931	Improve logging again. The user experience is greatly enhanced by this little change, where you know from the logs that pgloader could actually connect rather than thinking it might be still trying...	2015-08-20 12:38:19 +02:00
Dimitri Fontaine	5e7e5391ef	Fix the drop indexes option again, fix #251 . The index and constraint names given by PostgreSQL catalogs should not be second guessed, we need to just quote them. The identifier down casing is interesting when we get identifiers from other system for a migration, but are wrong when dropping existing indexes in PostgreSQL. Also, the :null special value from Postmodern was routing the code badly, just transform it manually to nil when fetching the index list, manually.	2015-07-26 15:38:15 +02:00
Dimitri Fontaine	3af99051d2	Fix the preserve index names option. MySQL names its primary keys "PRIMARY" and we need to always uniquify this name even when the used asked pgloader to preserve index names. Also, the create-indexes-again function now needs to ask for index names to be preserved specifically.	2015-07-18 23:39:32 +02:00
Dimitri Fontaine	54e29773d7	Fix index creation reporting, see #251 . The new option 'drop indexes' reuses the existing code to build all the indexes in parallel but failed to properly account for that fact in the summary report with timings. While fixing this, also fix the SQL used to re-establish the indexes and associated constraints to allow for parallel execution, the ALTER TABLE statements would block in ACCESS EXCLUSIVE MODE otherwise and make our efforts vain.	2015-07-18 23:06:15 +02:00
Dimitri Fontaine	8511294ac7	Generalize index support to handle constraints, fix #251 . PostgreSQL rightfully forbifs DROP INDEX when the index is used to enforce a constraint, the proper SQL to issue is then ALTER TABLE DROP CONSTRAINT. Also, in such a case pg_dump issues a single ALTER TABLE ADD CONSTRAINT statement to restore the situation. Have pgloader do the same with indexes that are used to back a constraint.	2015-07-17 17:06:09 +02:00
Dimitri Fontaine	4f2385fa4c	Refactor code from previous commit. The goal is to make it easy to add support for the 'drop indexes' option in other source types (fixed, ixf, db3, file-based sources).	2015-07-16 19:35:34 +02:00
Dimitri Fontaine	49bf7e56f2	Implement a "drop indexes" option in CSV mode, fix #251 . When loading against a table that already has index definitions, the load can be quite slow. Previous commit introduced a warning in such a case. This commit introduces the option "drop indexes" that is not used by default. When this option is used, pgloader drops the indexes before loading the data then create the indexes again with the same definitions as before. All the indexes are created again in parallel to optimize performances. Only primary key indexes can't be created in parallel, so those are created in two steps (create unique index then alter table).	2015-07-16 12:22:58 +02:00
Dimitri Fontaine	7c834db6e3	Warn against pre-existing indexes. Pre-existing indexes will reduce data loading performances and it's generally better to DROP the index prior to the load and CREATE them again once the load is done. See #251 for an example of that. In that patch we just add a WARNING against the situation, the next patch will also add support for a new WITH clause option allowing to have pgloader take care of the DROP/CREATE dance around the data loading.	2015-07-16 12:22:58 +02:00
Dimitri Fontaine	88c801997e	Default to a static list of PostgreSQL keywords. In some cases (such as when using a very old PostgreSQL instance or an Amazon Redshift service, as in #255), the function pg_get_keywords() does not exists but we assume that pgloader might still be able to complete its job. We're better off with a static list of keywords than with a unhandled error here, so let's see what happens next with Redshift.	2015-07-04 20:16:50 +02:00
Alex Baretta	49dcae8068	Fix DROP TABLE statements on tables with foreign keys	2015-05-28 14:24:28 -07:00
Alex Baretta	0626d56303	Fix identifier case in FOREIGN KEY constraints	2015-05-25 18:08:37 -07:00
Dimitri Fontaine	53dcdfd8ef	Fix handling of COPY data, fix #222 . When given a file in the COPY format, we should expect that its content is already properly escaped as expected by PostgreSQL. Rather than unescape the data then escape it again, add a new more of operation to format-vector-row in which it won't even try to reformat the data. In passing, fix an off-by-one bug in dealing with non-ascii characters.	2015-04-30 13:17:02 +02:00
Dimitri Fontaine	0068a45e1c	Fix parsing of qualified target table names, see #186 . We used to parse qualified table names as a simple string, which then breaks attempts to be smart about how to quote idenfifiers. Some sources are known to accept dots in quoted table names and we need to be able to process that properly without tripping on qualified table names too late. Current code might not be the best approach as it's just using either a cons or a string for table names internally, rather than defining a proper data structure with a schema and a name slot. Well, that's for a later cleanup patch, I happen to be lazy tonight.	2015-04-17 23:22:30 +02:00
Dimitri Fontaine	7d2d09ce68	Add the option to preserve MySQL index names, fix #187 . See test/parse/hans.goeuro.load for an example usage of the new option. In passing, any error when creating indexes is now properly reported and logged, which was missing previously. Oops.	2015-03-07 20:19:47 +01:00
Dimitri Fontaine	48f451bdbc	Implement the option to disable triggers when loading data. This option is dangerous and allows to skip ALL triggers when loading data against PostgreSQL. This includes foreign key constraints definitions and will allow loading data out of order. When using both the options "create no table" and "disable triggers" it will be possible to load data into a schema prepared by your favorite external tool, at the cost of not validating FK constraints. Use with care. Fix #167.	2015-02-19 15:05:10 +01:00
Dimitri Fontaine	47288d2818	Fix whitespace and indentation.	2015-02-19 10:30:42 +01:00
Victor Kryukov	c38ef4c235	Make quoting identifiers more robust: do not quote already quoted string, and double quotes when quoting. Fix #180 .	2015-02-19 10:26:44 +01:00
Dimitri Fontaine	100e942c22	Force quoting of some identifiers, per pg rules, fix #161 . PostgreSQL requires that an idenfitier begin with letters or underscore only, so an identifier that begins with a digit must be quoted. In the current coding pgloader will unecessarily quote some identifiers that begin with a unicode accentuated letter, but that's only cosmetic and isn't worth worrying about (famous last words).	2015-01-29 10:40:12 +01:00
Dimitri Fontaine	ce5a61face	Catch PostgreSQL internal errors too, fixes #155 .	2015-01-21 13:01:28 +01:00
Dimitri Fontaine	9f45b9864a	Implement support for update and delete rules for MysQL FKeys, fixes #141 .	2015-01-14 18:35:48 +01:00
Dimitri Fontaine	e1bc6425e2	Implement support for PostgreSQL COPY format, fix #145 . PostgreSQL COPY format is not really CSV but something way easier to parse. Funnily enough, parsing it as CSV is not that easy, so we add here a special simple parser for the COPY format. It should be quite useful too try loading again reject data files from pgloader after manual fixing, too. It's still missing some documentation without any good excuse for that, will add soon.	2015-01-02 18:49:17 +01:00
Dimitri Fontaine	b3a09a20e3	Further simplifications from stassats.	2014-12-26 22:33:15 +01:00
Dimitri Fontaine	302a7d402b	Refactor connection handling, and clean-up many things. That's the big refactoring patch I've been sitting on for too long. First, refactor connection handling to use a uniformed "connection" concept (class and generic functions API) everywhere, so that the COPY derived objects just use that in their :source-db and :target-db slots. Given that, we don't need no messing around with pgconn and myconn- and other special variables at all anywhere in the tree. Second, clean up some oddities accumulated over time, where some parts of the code didn't get the memo when new API got into place. Third, fix any other oddity or missing part found while doing those first two activities, it was long overdue anyway...	2014-12-26 21:50:29 +01:00
Dimitri Fontaine	9c6f78bff0	Stop debugging set-table-oids.	2014-12-22 17:16:42 +01:00
Dimitri Fontaine	5b726e47a0	Improve error reporting on connection error.	2014-12-19 14:24:35 +01:00
Dimitri Fontaine	47c22776f2	Cleanup and fix yesterday's refactoring of pgconn parameters.	2014-12-17 11:35:10 +01:00
Dimitri Fontaine	87f8e2392e	Cleanup a double pg-setting setting. That is now taken care of in the main query macros.	2014-12-16 18:46:07 +01:00
Dimitri Fontaine	073f012d1a	Add support for SSL modes in the PG connection string, fix #137 . In passing, refactor the *pgconn- dynamic bindings in favor of directly using the connection property list straight from the connection string parser, processing it when necessary. That allows to make it simple to add an internal :use-ssl property.	2014-12-16 18:45:43 +01:00
Dimitri Fontaine	d4cab3a81e	Fix MSSQL index and foreign key names. First, the index names in MS SQL, as in MySQL, are only unique per table, whereas they need to be globally unique (per database) in PostgreSQL. So reuse the infrastructure we had for MySQL here. Second, the way we trick table names in index and fkey structures means that we already did quote the names and we don't want to quote them again, so add a new possible identifier-case value to handle the case where nothing is to be done, pretty please.	2014-11-25 00:42:37 +01:00
Dimitri Fontaine	f263d1b2a4	Implement Foreign Key support for MSSQL. Piggyback as much as possible on the work already done for MySQL.	2014-11-24 23:42:19 +01:00
Dimitri Fontaine	e72325ee25	Improve Index build concurrency. Rather than doing ALTER TABLE directly, use CREATE UNIQUE INDEX in the all in parallel concurrent index build per table, and only in the end game "upgrade" that unique index into a PRIMARY KEY using ALTER TABLE. The reason why it's a good idea to do that is to avoid an ACCESS EXCLUSIVE LOCK at ALTER TABLE time, which is killing our index build concurrency.	2014-11-24 22:05:22 +01:00
Dimitri Fontaine	5b87b1a85e	Refactor identifier-case option into a dynamic binding. That makes it much easier to use from about anywhere in the code, which is what is needed. In passing, fix #129.	2014-11-21 23:32:02 +01:00
Dimitri Fontaine	ed853a7bea	Allow pgloader to work on windows.	2014-11-06 22:12:20 +01:00
Dimitri Fontaine	be9abe48fe	Cleanup some pgsql connection handling.	2014-09-10 22:20:20 +02:00
Neil Gentleman	16d3edf5be	check for reserved keyword after downcasing uppercase USER isn't reserved, but lowercase is	2014-08-08 18:25:06 -07:00

1 2 3

102 Commits