pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2026-01-29 02:51:53 +01:00

Author	SHA1	Message	Date
Dimitri Fontaine	8a596ca933	Move connection into utils. There's no reason why this file should be in the src/ top-level.	2016-01-07 16:42:43 +01:00
Dimitri Fontaine	d1a2e3f46b	Improve the Dockerfile and the versioning. When building from sources within the git environement, the version number is ok, but it was wrong when building in the docker image. Fix the version number to 3.3.0.50 to show that we're talking about a development snapshot that is leading to version 3.3.1. Yeah, 4 parts version numbers. That happens, apparently.	2016-01-07 10:21:52 +01:00
Dimitri Fontaine	ee2a68f924	Improve Dockerfile. It was quite idiotic to RUN a git clone rather than just use the files from the docker context...	2016-01-05 11:28:19 +01:00
Dimitri Fontaine	286a39f6e6	Proof read of the README.md file. Some advice was pretty ancient, and we should now mention debian packaging support and the docker hub image.	2016-01-04 23:22:52 +01:00
Dimitri Fontaine	f8cb7601c5	Implement a Dockerfile. Apparently it's quite common nowadays for people to use docker to build and run software in a contained way, so provide users with the facility they need in order to do that.	2016-01-04 21:05:46 +01:00
Dimitri Fontaine	1bbbf96ba7	Fix minor API glitch/typo.	2016-01-04 21:01:15 +01:00
Dimitri Fontaine	a7291e9b4b	Simplify copy-database implementation further. Following-up to the recent refactoring effort, the IXF and DB3 source classes didn't get the memo that they could piggyback on the generic copy-database implementation. This patch implements that. In passing, also simplify the instanciate-table-copy-object method for copy subclasses that need specialization here, by using change-class and call-next-method so as to reuse the generic code as much as possible.	2016-01-01 14:28:09 +01:00
Dimitri Fontaine	24cd0de9f7	Install the :create-schemas option back. In the previous refactoring patch that option mistakenly went away, although it is still needed for MS SQL and it is planned to make use of it in the other source types too... See #316 for reference.	2016-01-01 13:35:35 +01:00
Dimitri Fontaine	9e4938cea4	Implement PostgreSQL catalogs data structure. In order to share more code in between the different source types, finally have a go at the quite horrible mess of anonymous data structures floating around. Having a catalog and schema instances not only allows for code cleanup, but will also allow to implement some bug fixes and wishlist items such as mapping tables from a schema to another one. Also, supporting database sources having a notion of "schema" (in between "catalog" and "table") should get easier, including getting on-par with MySQL in the MS SQL support (materialized views has been asked for already). See #320, #316, #224 for references and a notion of progress being made. In passing, also clean up the copy-databases methods for database source types, so that they all use a fetch-metadata generic function and a prepare-pgsql-database and a complete-pgsql-database generic function. Actually, a single method does the job here. The responsibility of introspecting the source to populate the internal catalog/schema representation is now held by the fetch-metadata generic function, which in turn will call the specialized versions of list-all-columns and friends implementations. Once the catalog has been fetched, an explicit CAST call is then needed before we can continue. Finally, the fields/columns/transforms slots in the copy objects are still being used by the operative code, so the internal catalog representation is only used up to starting the data copy step, where the copy class instances are then all that's used. This might be refactored again in a follow-up patch.	2015-12-30 21:53:01 +01:00
Dimitri Fontaine	d84ec3f808	Add SQLite test case for before/after load commands. See bug #321, this change should have been part of previous commit.	2015-12-23 21:58:56 +01:00
Dimitri Fontaine	8355b2140e	Implement before/after load support for SQLite, fix #321 . If there ever was a good reason not to implement before/after support for SQLite, it's no longer valid: done.	2015-12-23 21:56:10 +01:00
Dimitri Fontaine	72e7c2af70	At long last, log cast rule choices, see #317 . To help debug the casting rule choices, output a line for each decision that is made with the input and the output of the decision.	2015-12-08 21:27:33 +01:00
Dimitri Fontaine	735cdc8fdc	Document the remove-null-characters transform. Both as a new transformation function available, and as the default for Text conversions when coming from MySQL. See #258, Fixes #219.	2015-12-08 21:04:47 +01:00
Aliaksei Urbanski	3a55d80411	Default cast rules for MySQL's text types fixed, see #219	2015-12-08 20:59:29 +01:00
Dimitri Fontaine	b4bfa18877	Fix more table name quoting, fix #163 again. Now that we can have several threads doing COPY, each of them need to know about the pgsql-reserved-keywords list. Make sure that's the case and in passing fix some call sites to apply-identifier-case. Also, more disturbingly, fix the code so that TRUNCATE is called from the main thread before giving control to the COPY threads, rather than having two concurrent threads doing the TRUNCATE twice. It's rather strange that we got no complaint from the field on that part...	2015-12-08 11:52:43 +01:00
Dimitri Fontaine	dca3dacf4b	Don't issue useless MySQL catalog queries... When the option "WITH no foreign keys" is in use, it's not necessary to go read the foreign key information_schema bits at all, so just don't issue the query, and same thing with the "create no indexes" option. In no that old versions of MySQL, the referential_constraints table of information_schema doesn't exist, so this should make pgloader compatible with MySQL 5.0 something and earlier.	2015-12-03 19:24:00 +01:00
Dimitri Fontaine	7c64a713d0	Fix PostgreSQL write times in the summary. It turns out the summary write times included time spent waiting for batches to be ready, which isn't fair to PostgreSQL COPY implementation, and moreover doesn't help figuring out the bottlenecks...	2015-11-29 23:23:30 +01:00
Dimitri Fontaine	cca44c800f	Simplify batch and transformation handling. Make batches of raw data straight from the reader output (map-rows) and have the transformation worker focus on changing the batch content from raw rows to copy strings. Also review the organisation of responsabilities in the code, allowing to move queue.lisp into utils/batch.lisp, renaming it as its scope has been reduced to only care about preparing batches. This came out of trying to have multiple workers concurrently processing the batches from the reader and feeding the hardcoded 2 COPY workers, but it failed for multiple reasons. All is left as of now is this cleanup, which seems to be on the faster side of things, which is always good.	2015-11-29 17:35:25 +01:00
Dimitri Fontaine	2dd7f68a30	Fix index completion management in MySQL and SQLite. We used to wait for the wrong number of workers, meaning the rest of the code began running before the indexes where all available. A user report where one of the indexes takes a very long time to compute made it obvious. In passing, also improve reporting of those rendez-vous sections.	2015-11-29 17:29:57 +01:00
Dimitri Fontaine	af9e423f0b	Fix errors counting, see #313 . It came pretty obvious that the error counting was broken, it happens that I forgot to pass the information down to the state handling parts of the code. In passing improve and fix CSV parse errors counting and fatal errors reporting.	2015-11-27 11:51:48 +01:00
Dimitri Fontaine	533a49a261	Handle PostgreSQL notifications early, fix #311 . In some cases, like when client_min_messages is set to debug5, PostgreSQL might send notification messages to the connecting client even while opening a connection. Those are still considered WARNINGs by the Postmodern driver... Handle those warnings by just printing them out in the pgloader logs, rather than considering those conditions as hard failures (signaling a db-connection-error).	2015-11-24 10:52:25 +01:00
Dimitri Fontaine	93b6be43d4	Travis: adapt to PostgreSQL 9.1, again. We didn't have CREATE SCHEMA IF EXISTS at the time...	2015-11-23 22:09:08 +01:00
Dimitri Fontaine	f109c3fdb4	Travis: prepare an "err" schema. The test/errors.load set the search_path to include the 'err' schema, which is to be created by the test itself. PostgreSQL 9.1 raises an error where 9.4 and following just accept the setting, and Travis runs a 9.1 PostgreSQL. Let's just create the schema before-hand so that we can still run tests against SET search_path from the load file.	2015-11-23 15:26:18 +01:00
Dimitri Fontaine	4e23de1b2b	Missing file from previous commit. Somehow it still happens :/	2015-11-22 23:32:22 +01:00
Dimitri Fontaine	973339abc8	Add a SQLite test case from #310 .	2015-11-22 22:16:05 +01:00
Dimitri Fontaine	e23de0ce9f	Improve SQLite table names filtering. Filter the list of tables we migrate directly from the SQLite query, avoiding to return useless data. To do that, use the LIKE pattern matching supported by SQLite, where the REGEX operator is only available when extra features are loaded apparently. See #310 where filtering out the view still caused errors in the loading.	2015-11-22 22:10:26 +01:00
Dimitri Fontaine	a81f017222	Review SQLite integration with recent changes. The current way to do parallelism in pgloader was half baked in the SQLite source implementation, get it up to speed again.	2015-11-22 21:30:20 +01:00
Dimitri Fontaine	5f60ce3d96	Travis: create the "expected" schema. As we don't use the `make -C test prepare` target, reproduce a missing necessary precondition to running the unit tests.	2015-11-21 21:19:37 +01:00
Dimitri Fontaine	bc870ac96c	Use 2 copy threads per target table. It's been proven by Andres Freund benchmarks that the best number of parallel COPY threads concurrently active against a single table is 2 as of PostgreSQL current development version (up to 9.5 stable, might still apply to 9.6 depending on how we might solve the problem). Henceforth hardcode 2 COPY threads in pgloader. This also has the advantage that in the presence of lots of bad rows, we should sustain a better throughtput and not stall completely. Finally, also improve the MySQL setup to use 8 threads by default, that is be able to load two tables concurrently, each with 2 COPY workers, a reader and a transformer thread. It's all still experimental as far as performances go, next patches should bring the capability to configure the parallelism wanted from the command line and load command tho. Also, other source types will want to benefit from the same capabilities, it just happens that it's easier to play with MySQL first for some reasons here.	2015-11-17 17:06:36 +01:00
Dimitri Fontaine	150d288d7a	Improve our regression testing facility. Next parallelism improvements will allow pgloader to use more than one COPY thread to load data, with the impact of changing the order of rows in the database. Rather than doing a copy out and `diff` of the data just loaded, load the reference data and do the diff in SQL: select * from loaded.data except select * from expected.data If such a query returns any row, we know we didn't load what was expected and the regression test is failing. This regression testing facility should also allow us to finally add support for multiple-table regression tests (sqlite, mysql, etc).	2015-11-17 17:03:08 +01:00
Dimitri Fontaine	6ca376ef9b	Simplify the main function (refactor). Move some code away in its own function for easier review and modifications of the main entry point.	2015-11-16 16:01:25 +01:00
Dimitri Fontaine	da05782002	Allow date formats to miss time parts. In case the seconds field are not provided just use "00" rather than NIL as currently...	2015-11-14 21:14:20 +01:00
Dimitri Fontaine	6cbec206af	Turns out SSL key/crt file paths should be strings. Our PostgreSQL driver uses CFFI to load the SSL support from open ssl and as a result the certificate and key file names should be strings rather than pathnames. Should fix #308 again...	2015-11-11 23:10:29 +01:00
Dimitri Fontaine	f8ae9f22b9	Implement support for SSL client certificates. This fixes #308 by automatically using the PostgreSQL Client Side SSL files as documented in the following reference: http://www.postgresql.org/docs/current/static/libpq-ssl.html#LIBPQ-SSL-FILE-USAGE This uses the Postmodern special support for it. Unfortunately couldn't test it locally other than it doesn't break non-ssl connections. Pushing to have user feedback.	2015-11-09 11:32:17 +01:00
Dimitri Fontaine	e3cc76b2d4	Export copy-column-list from pgloader.sources. That allows the copy-column-list specific method for MySQL to be a method of the common pgloader.sources::copy-column-list generic function, and then to be called again when needed. This fixes an oversight in #41e9eeb and fixes #132 again.	2015-11-08 18:48:54 +01:00
Dimitri Fontaine	042045c0a6	Clozure already provides a getenv function. Trying to provide a new one fails with an error, that I missed because I must have forgotten to `make clean` when adding the previous #+ccl variant here... This alone doesn't allow to fix building for CCL but already improves the situation as reported at #303. Next failure is something I fail to understand tonight: Fatal SIMPLE-ERROR: Compilation failed: In MAKE-DOUBLE-FLOAT: Type declarations violated in (THE FIXNUM 4294967295) in /Users/dim/dev/pgloader/build/quicklisp/local-projects/qmynd/src/common/utilities.lisp	2015-11-08 18:26:58 +01:00
Dimitri Fontaine	3673c5c341	Add a Travis CI Build badge.	2015-10-31 17:48:30 +01:00
Dimitri Fontaine	20ce095384	Merge pull request #305 from gitter-badger/gitter-badge Add a Gitter chat badge to README.md	2015-10-31 17:46:20 +01:00
The Gitter Badger	56a60db146	Add Gitter badge	2015-10-31 16:40:52 +00:00
Dimitri Fontaine	478d24f865	Fix root-dir initialization for ccl, see #303 . When using Clozure Common Lisp apparently a :absolute directory component for make-pathname is supposed to contain a single path component, fix by using parse-native-namestring instead. In case it's needed, the following spelling seems portable enough: CL-USER> (uiop:merge-pathnames* (uiop:make-pathname* :directory '(:relative "pgloader")) (uiop:make-pathname* :directory '(:absolute "tmp"))) #P"/tmp/pgloader/"	2015-10-24 22:22:24 +02:00
Dimitri Fontaine	4df3167da1	Introduce another worker thread: transformers. We used to have a reader and a writer cooperating concurrently into loading the data from the source to PostgreSQL. The tranformation of the data was then the responsibility of the reader thread. Measurements showed that the PostgreSQL processes were mostly idle, waiting for the reader to produce data fast enough. In this patch we introduce a third worker thread that is responsible for processing the raw data into pre-formatted batches, allowing the reader to focus on extracting the data only. We now have two lparallel queues involved in the processing, the raw queue contains the vectors of raw data directly, and the processed-queue contains batches of properly encoded strings for the COPY text protocol. On the test laptop the performance gain isn't noticeable yet, it might be that we need much larger data sets to see a gain here. At least the setup isn't detrimental to performances on smaller data sets. Next improvements are going to allow more features: specialized batch retry thread and parallel table copy scheduling for database sources. Let's also continue caring about performances and play with having several worker and writer threads for each reader. In later patches. And some day, too, we will need to make the number of workers a user defined variable rather than something hard coded as today. It's on the todo list, meanwhile, dear user, consider changing the (make-kernel 6) into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp, and consider enlighting me with whatever it is you find by doing so!	2015-10-23 00:17:58 +02:00
Dimitri Fontaine	4f3b3472a2	Cleanup and timing display improvements. Have each thread publish its own start-time so that the main thread may compute time spent in source and target processing, in order to fix the crude hack of taking (max read-time write-time) in the total time column of the summary. We still have some strange artefacts here: we consider that the full processing time is bound to the writer thread (:target), because it needs to have the reader done already to be able to COPY the last batch... but in testing I've seen some :source timings higher than the :target ones... Let's solve problems one at a time tho, I guess multi-threading and accurate wall clock times aren't to be expected to mix and match that easily anyway (multi cores, single RTC and all that).	2015-10-22 22:36:40 +02:00
Dimitri Fontaine	933d1c8d6b	Add test case for #302 .	2015-10-22 22:35:32 +02:00
Dimitri Fontaine	88bb4e0b95	Register "auto_increment" as a SQLite noise word. As seen in #302 it's possible to define a SQLite column of type "integer auto_increment". In my testing tho, it doesn't mean a thing. Worse than that, apparently when an integer column is created that is also used as the primary key of the table, the notation "integer auto_increment primary key" disables the rowid behavior that is certainly expected. Let's not yet mark the bug as fixed as I suppose we will have to do something about this rowid mess. Thanks again SQLite.	2015-10-22 21:55:34 +02:00
Dimitri Fontaine	f654f10d0d	Fix CSV summary format string. This got broken when adding read/write separate stats in the reporting.	2015-10-20 23:56:20 +02:00
Dimitri Fontaine	2ed14d595d	Trick the reporting to show non-zero timings. When calling lparallel:receive-results from the main threads we loose the ability to measure proper per-table processing times, because it all happens in parallel and the main threads seems to receive events in the same millisecond as when the worker is started, meaning it's all 0.0s. So when we don't have "secs" stats, pick the greatest of read or write time, which we do have from the worker threads themselves. The number are still wrong, but less so than the "0.0s" displayed before this patch.	2015-10-20 23:39:15 +02:00
Dimitri Fontaine	69b8b0305d	Fix reporting in case of missing values.	2015-10-20 23:24:03 +02:00
Dimitri Fontaine	1fb69b2039	Retry connecting to PostgreSQL in some cases. Now that we can setup many concurrent threads working against the PostgreSQL database, and before we open the number of workers to our users, install an heuristic to manage the PostgreSQL error classes “too many connection” and “configuration limit exceeded” so that pgloader waits for some time (retry-connect-delay) then tries connecting again. It's quite simplistic but should cover lots of border-line cases way more nicely than just throwing the interactive debugger at the end user.	2015-10-20 23:15:05 +02:00
Dimitri Fontaine	633067a0fd	Allow more parallelism in database migrations. The newly added statistics are showing that read+write times are not enough to explain how long we wait for the data copying, so it must be the workers setup rather than the workers themselves. From there, let lparallel work its magic in scheduling the work we do in parallel in pgloader: rather than doing blocking receive-result calls for each table, only receive-result at the end of the whole copy-database processing. On test data here on the laptop we go from 6s to 3s to migrate the sakila database from MySQL to PostgreSQL: that's because we have lots of very small tables, so the cost of waiting after each COPY added up quite quickly. In passing, stop sharing the same connection object in between parallel workers that used to be controlled active in-sequence, see the new API clone-connection (which takes over new-pgsql-connection).	2015-10-20 22:15:55 +02:00
Dimitri Fontaine	187565b181	Add read/write separate stats. Add metrics to devise where the time is spent in current pgloader code so that it's possible to then optimize away the batch processing as we do it today. Given the following extract of the measures, it seems that doing the data transformations in the reader thread isn't so bright an idea. More to come. table name total time read write ----------------- -------------- --------- --------- extract 2.014s before load 0.050s fetch 0.000s ----------------- -------------- --------- --------- geolite.location 16.090s 15.933s 5.732s geolite.blocks 28.896s 28.795s 5.312s ----------------- -------------- --------- --------- after load 37.772s ----------------- -------------- --------- --------- Total import time 1m25.082s 44.728s 11.044s	2015-10-11 21:35:19 +02:00

1 2 3 4 5 ...

998 Commits