pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2026-01-26 01:21:02 +01:00

Author	SHA1	Message	Date
Dimitri Fontaine	9e574ce884	Rename web/ into docs/ This allows to benefit from github pages without having to maintain a separate orphaned branch.	2016-08-19 20:55:29 +02:00
Dimitri Fontaine	a86a606d55	Improve existing PostgreSQL database handling. When loading data into an existing PostgreSQL catalog, we DROP the indexes for better performance of the data loading. Some of the indexes are UNIQUE or even PRIMARY KEYS, and some FOREIGN KEYS might depend on them in the PostgreSQL dependency tracking of the catalog. We used to use the CASCADE option when dropping the indexes, which hides a bug: if we exclude from the load tables with foreign keys pointing to tables we target, then we would DROP those foreign keys because of the CASCADE option, but fail to install them again at the end of the load. To prevent that from happening, pgloader now query the PostgreSQL pg_depend system catalog to list the “missing” foreign keys and add them to our internal catalog representation, from which we know to DROP then CREATE the SQL object at the proper times. See #400 as this was an oversight in fixing this issue.	2016-08-10 22:02:06 +02:00
Dimitri Fontaine	70572a2ea7	Implement support for existing target databases. Also known as the ORM case, it happens that other tools are used to create the target schema. In that case pgloader job is to fill in the exiting target tables with the data from the source tables. We still focus on load speed and pgloader will now DROP the constraints (Primary Key, Unique, Foreign Keys) and indexes before running the COPY statements, and re-install the schema it found in the target database once the data load is done. This behavior is activated when using the “create no tables” option as in the following test-case setup: with create no tables, include drop, truncate Fixes #400, for which I got a test-case to play with!	2016-08-06 20:19:15 +02:00
Dimitri Fontaine	42c8012e94	Cleanup: remove the now unused file!	2016-08-05 11:39:16 +02:00
Dimitri Fontaine	2aedac7037	Improve our internal catalog representation. First, add index and foreign keys to the list of objects supported by the shared catalog facility, where is was only found in the pgsql schema specific package for historical raisons. Then also add to our catalog internal structures the notion of a trigger and a stored procedure, allowing for cleaner advanced default values support in the MySQL cast functions. Once we now have a proper and complete catalog, review the pgsql module DDL output function in terms of the catalog and rewrite the schema creation support so that it takes direct benefit of our internal catalogs representation. In passing, clean-up the code organisation of the pgsql target support module to be easier to work with. Next step consists of getting rid of src/pgsql/queries.lisp: this facility should be replaced by the usage of a target catalog that we fetch the usual way, thanks to the new src/pgsql/pgsql-schema.lisp file and list-all-* functions. That will in turn allow for an explicit step of merging the pre-existing PostgreSQL catalog when it's been created by other tools than pgloader, that is when migrating with the help of an ORM. See #400 for details.	2016-08-01 23:14:58 +02:00
Dimitri Fontaine	5e18cfd7d4	Implement support for partial indexes. MS SQL has a notion of a "filtered index" that matches the notion of a PostgreSQL partial index: the index only applies to the rows matching the index WHERE clause, or filter. The WHERE clause in both case are limited to simple expressions over a base table's row at a time, so we implement a limited WHERE clause parser for MS SQL filters and a transformation routine to rewrite the clause in PostgreSQL slang. In passing, we transform the filter constants using the same transformation functions as in the CAST rules, so that e.g. a MS SQL bit(1) value that got transformed into a PostgreSQL boolean is properly translated, as in the following example: MS SQL: "([deleted]=(0))" (that's from the catalogs) PostgreSQL: deleted = 'f' Of course the parser is still very badly tested, let's see what happens in the wild now. (Should) Fix #365.	2016-03-21 23:39:45 +01:00
Dimitri Fontaine	c724018840	Implement ALTER TABLE clause for MySQL migrations. The new ALTER TABLE facility allows to act on tables found in the MySQL database before the migration happens. In this patch the only provided actions are RENAME TO and SET SCHEMA, which fixes #224. In order to be able to provide the same option for MS SQL users, we will have to make it work at the SCHEMA level (ALTER SCHEMA ... RENAME TO ...) and modify the internal schema-struct so that the schema slot of our table instances are a schema instance rather than its name. Lacking MS SQL test database and instance, the facility is not yet provided for that source type.	2016-03-06 21:51:33 +01:00
Dimitri Fontaine	782561fd4e	Handle default value transforms errors, fix #333 . It turns out that MySQL catalog always store default value as strings even when the column itself is of type bytea. In some cases, it's then impossible to transform the expected bytea from a string. In passing, move some code around to fix dependencies and make it possible to issue log warnings from the default value printing code.	2016-02-03 12:27:58 +01:00
Dimitri Fontaine	aa8b756315	Fix when to create indexes. In the recent refactoring and improvements of parallelism the indexes creation would kick in before we know that the data is done being copied over to the target table. Fix that by maintaining a writers-count hashtable and only starting to create indexes when that count reaches zero, meaning all the concurrent tasks started to handle the COPY of the data are now done.	2016-01-16 19:50:21 +01:00
Dimitri Fontaine	8a596ca933	Move connection into utils. There's no reason why this file should be in the src/ top-level.	2016-01-07 16:42:43 +01:00
Dimitri Fontaine	9e4938cea4	Implement PostgreSQL catalogs data structure. In order to share more code in between the different source types, finally have a go at the quite horrible mess of anonymous data structures floating around. Having a catalog and schema instances not only allows for code cleanup, but will also allow to implement some bug fixes and wishlist items such as mapping tables from a schema to another one. Also, supporting database sources having a notion of "schema" (in between "catalog" and "table") should get easier, including getting on-par with MySQL in the MS SQL support (materialized views has been asked for already). See #320, #316, #224 for references and a notion of progress being made. In passing, also clean up the copy-databases methods for database source types, so that they all use a fetch-metadata generic function and a prepare-pgsql-database and a complete-pgsql-database generic function. Actually, a single method does the job here. The responsibility of introspecting the source to populate the internal catalog/schema representation is now held by the fetch-metadata generic function, which in turn will call the specialized versions of list-all-columns and friends implementations. Once the catalog has been fetched, an explicit CAST call is then needed before we can continue. Finally, the fields/columns/transforms slots in the copy objects are still being used by the operative code, so the internal catalog representation is only used up to starting the data copy step, where the copy class instances are then all that's used. This might be refactored again in a follow-up patch.	2015-12-30 21:53:01 +01:00
Dimitri Fontaine	cca44c800f	Simplify batch and transformation handling. Make batches of raw data straight from the reader output (map-rows) and have the transformation worker focus on changing the batch content from raw rows to copy strings. Also review the organisation of responsabilities in the code, allowing to move queue.lisp into utils/batch.lisp, renaming it as its scope has been reduced to only care about preparing batches. This came out of trying to have multiple workers concurrently processing the batches from the reader and feeding the hardcoded 2 COPY workers, but it failed for multiple reasons. All is left as of now is this cleanup, which seems to be on the faster side of things, which is always good.	2015-11-29 17:35:25 +01:00
Dimitri Fontaine	e23de0ce9f	Improve SQLite table names filtering. Filter the list of tables we migrate directly from the SQLite query, avoiding to return useless data. To do that, use the LIKE pattern matching supported by SQLite, where the REGEX operator is only available when extra features are loaded apparently. See #310 where filtering out the view still caused errors in the loading.	2015-11-22 22:10:26 +01:00
Dimitri Fontaine	41e9eebd54	Rationalize common generic API implementation. When devising the common API, the first step has been to implement specific methods for each generic function of the protocol. It now appears that in some cases we don't need the extra level of flexibility: each change of the API has been systematically reported to all the specific methods, so just use a single generic definition where possible. In particular, introduce new intermediate class for COPY subclasses allowing to share more common code in the methods implementation, rather than having to copy/paste and maintain several versions of the same code. It would be good to be able to centralize more code for the database sources and how they are organized around metadata/import-data/complete schema, but it doesn't look obvious how to do it just now.	2015-10-05 21:25:21 +02:00
Dimitri Fontaine	7b9b8a32e7	Move sexp parsing into its own file. After all, it's shared between the CSV command parsing and the Cast Rules parsing. src/parsers/command-csv.lisp still contains lots of facilities shared between the file based sources, will need another series of splits.	2015-10-05 11:39:44 +02:00
Dimitri Fontaine	96a33de084	Review the stats and reporting code organisation. In order to later be able to have more worker threads sharing the load (multiple readers and/or writers, maybe more specialized threads too), have all the stats be managed centrally by a single thread. We already have a "monitor" thread that get passed log messages so that the output buffer is not subject to race conditions, extend its use to also deal with statistics messages. In the current code, we send a message each time we read a row. In some future commits we should probably reduce the messaging here to something like one message per batch in the common case. Also, as a nice side effect of the code simplification and refactoring this fixes #283 wherein the before/after sections of individual CSV files within an ARCHIVE command where not counted in the reporting.	2015-10-05 01:46:29 +02:00
Dimitri Fontaine	eabfbb9cc8	Fix schema qualified table names usage (more). When parsing table names in the target URI, we are careful of splitting the table and schema name and store them into a cons in that case. Not all sources methods got the memo, clean that up. See #182 and #186, a pull request I am now going to be able to accept. Also see #287 that should be helped by being able to apply #186.	2015-09-04 01:06:15 +02:00
Dimitri Fontaine	56a89e9b53	Cleanup schema data structure building. As reported by clisp maintainer (thanks jackdaniel!) when trying to load pgloader, we had redoundant labels function names in places. Get rid of those by pushing the new columns found directly at the end of the list, avoiding the bulky code to then reverse the complex anonymous data structure. The Real Fix™ would be to define proper structures where to hold all those database catalogs representation, but that's an invasive patch and now isn't a good time to write it. At least pgloader should load and run with clisp now.	2015-08-15 23:54:45 +02:00
Dimitri Fontaine	d1fce3728a	Allow more PostgreSQL URI options, fix #199 . As per PostgreSQL documentation on connection strings, allow overriding of main URI components in the options parts, with a percent-encoded syntax for parameters. It allows to bypass the main URI parser limitations as seen in #199 (how to have a password start with a colon?). See: http://www.postgresql.org/docs/9.3/interactive/libpq-connect.html#LIBPQ-CONNSTRING	2015-05-22 23:39:04 +02:00
Dimitri Fontaine	55584406fa	Add encoding support for db3 sources, fix #176 . It appears that db3 files are not limited to the ASCII character encoding that they were designed with, so let's clue pgloader about that. This commit build `770cbe3526` and the pgloader Makefile has been updated to momentarily fetch cl-db3 from github rather than Quicklisp so that it's possible to enjoy the new feature immediately.	2015-02-18 22:40:03 +01:00
Dimitri Fontaine	cd46b6cbed	Clean up common code for sources. Only move code around, creating a src/sources/common directory with several files in there so as to split the too big src/sources.lisp.	2015-01-08 23:17:40 +01:00
Dimitri Fontaine	e1bc6425e2	Implement support for PostgreSQL COPY format, fix #145 . PostgreSQL COPY format is not really CSV but something way easier to parse. Funnily enough, parsing it as CSV is not that easy, so we add here a special simple parser for the COPY format. It should be quite useful too try loading again reject data files from pgloader after manual fixing, too. It's still missing some documentation without any good excuse for that, will add soon.	2015-01-02 18:49:17 +01:00
Dimitri Fontaine	302a7d402b	Refactor connection handling, and clean-up many things. That's the big refactoring patch I've been sitting on for too long. First, refactor connection handling to use a uniformed "connection" concept (class and generic functions API) everywhere, so that the COPY derived objects just use that in their :source-db and :target-db slots. Given that, we don't need no messing around with pgconn and myconn- and other special variables at all anywhere in the tree. Second, clean up some oddities accumulated over time, where some parts of the code didn't get the memo when new API got into place. Third, fix any other oddity or missing part found while doing those first two activities, it was long overdue anyway...	2014-12-26 21:50:29 +01:00
Dimitri Fontaine	87e157bee2	Add a new database source type in the parser. Now it's possible to parse a command to load data from MS SQL. The parser was until now parsing all database URI within the same common rule and that isn't possible anymore if we want to distinguish in between source database right from the parser, which we actually want to do. This patch also implement in-passing fixes all over the place, including the transformation function float-to-string that only happened to work on double-float data.	2014-11-17 00:23:06 +01:00
Dimitri Fontaine	fff756f95f	Refactor the command parser. Split its content into separate files, so that each is easier to maintain, and to make it easier also to add support for new sources.	2014-11-16 22:22:04 +01:00
Dimitri Fontaine	03bba5f486	Some more SQL Server support (schema conversion). Converting the table definitions (with type casting) seems to work. Also did experiment a little with actuallt fetching some data... and had to edit the cl-mssql driver, which is temporarily monkey patched.	2014-11-10 01:16:10 +01:00
Dimitri Fontaine	ca325ba799	Refactor the SQLite source files.	2014-11-09 22:59:30 +01:00
Dimitri Fontaine	6473a892d4	First steps toward MS SQL compatibility.	2014-11-09 00:09:42 +01:00
Dimitri Fontaine	ed853a7bea	Allow pgloader to work on windows.	2014-11-06 22:12:20 +01:00
Dimitri Fontaine	3c334dcdc4	Refactor the main parser to use the `bind` macro. The metabang-bind lib offers a nice bind macro that solves the problem of ignoring bindings in destructuring-bind, and allows a let* approach to nested destructuring (wven when mixed with let declarations). Using that lib (that we already indirectly depend on anyway) simplifies the parser code substantially.	2014-10-02 17:05:35 +02:00
Dimitri Fontaine	7cf7e714fc	Implement the source date format option.	2014-10-02 01:03:24 +02:00
Dimitri Fontaine	2369a142a7	Refactor source code organisation. In passing, fix a bug in the previous commit where left-over code would cancel the whole new parsing code for advanced source fields options.	2014-10-01 23:20:24 +02:00
Dimitri Fontaine	422f87e912	We don't use the zip system anymore.	2014-09-10 22:19:59 +02:00
Dimitri Fontaine	3e0526c957	Implement early support for IXF files.	2014-07-14 21:53:50 +02:00
Dimitri Fontaine	807f5cefcd	Fix omitted file dependency (reading queries from file).	2014-06-16 14:24:05 +02:00
Dimitri Fontaine	c3742a9410	Typo fix cl-base64 system's name, fix the fix for #60 .	2014-05-16 23:36:45 +02:00
Dimitri Fontaine	9e12035ca1	Review SQLite blob types in light of "manifest typing", fix #60 . When using SQLite 3, a blob column might return either string of byte vector values dynamically depending on the data itself, or maybe some more complex parameters controlled at data insert time. Hard-code the rule that a blob column returned as a string is in fact base64 encoded (which looks like common practice) and decode it automatically when needed, before sending to byte-vector-to-bytea. It might be a tad slow but at least the data is properly converted. In future, that decision might come and byte us in the back again, at which point it'll be necessary to consider full casting options as in the MySQL CAST rules. It seems like a big enough win for now if we can avoid that.	2014-05-16 23:13:57 +02:00
Dimitri Fontaine	35ca4927e9	Get rid of some lib dependencies. The charset business isn't worth depending on an AGPL licenced lib which is part of a huge Quicklisp system.	2014-04-25 17:21:11 +02:00
Dimitri Fontaine	4d6def8105	Move some MySQL old import/export functions apart...	2014-03-04 13:52:48 +01:00
Dimitri Fontaine	db947e1467	Rework reader and writer data exchange. With this patch, the whole data massaging and final formating into the PostgreSQL COPY TEXT format is done by the reader thread, which publishes a batch at a time in the communication channel: a lparallel.queue object. Before that, the raw vectors where pushed directly in the queue, offering more flexibility to adjust to the reader and writer IO rates and capabilities, but impeding the ability of the Garbage Collector: data still in the queue was not collected even if not needed anymore. The new model also uses less memory, and allows a better control over what amount of data stays in memory. The new concurrent-batches parameter should be key to being able to process huge rows. The intent is to offering a way for the users to tune concurrent-batches down to 1 for sources with massive per-row memory footprint. Even better would be to find a way to automatically adjust the setting without spending too much time counting the bytes we're batching. Preliminary tests show no sensible impact on performances from this patch, even some improvements in cases.	2014-01-25 23:54:49 +01:00
Dimitri Fontaine	a51a712b6a	Fix asd dependencies, cleanup useless and misplaced compilation options.	2014-01-21 14:37:26 +01:00
Dimitri Fontaine	2080d91e40	Fix dependency declarations in between files, should help with #19 .	2014-01-02 23:48:57 +01:00
Dimitri Fontaine	17b366ca82	Create a website to present the software.	2014-01-02 23:25:23 +01:00
Dimitri Fontaine	b2c9e0d2dc	Refactor the whole logging infrastructure not to depend on threads sharing streams.	2013-12-24 19:08:55 +01:00
Dimitri Fontaine	f02eb641b4	Switch from cl-mysql to qmynd, an all-lisp driver for MySQL.	2013-12-03 22:05:39 +01:00
Dimitri Fontaine	3486cc688f	Looks like I forgot to add fixed.lisp in the asd system definitions.	2013-11-08 21:50:40 +01:00
Dimitri Fontaine	5ce5d53d7d	Use trivial-backtrace to display more useful information in case of unexpected events, hopefully.	2013-11-07 20:14:06 +01:00
Dimitri Fontaine	6a75187b7d	Refactor MySQL to use the new API.	2013-11-04 19:16:08 +01:00
Dimitri Fontaine	0a38195853	Refactoring the API with a real definition of it, and reorg the source tree.	2013-11-04 13:21:45 +01:00
Dimitri Fontaine	50114a0d3a	Hack-in some support for SQLite data source, including some refactoring preps.	2013-10-24 00:21:46 +02:00

1 2

73 Commits