pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2025-08-08 23:37:00 +02:00

Author	SHA1	Message	Date
Dimitri Fontaine	5e18cfd7d4	Implement support for partial indexes. MS SQL has a notion of a "filtered index" that matches the notion of a PostgreSQL partial index: the index only applies to the rows matching the index WHERE clause, or filter. The WHERE clause in both case are limited to simple expressions over a base table's row at a time, so we implement a limited WHERE clause parser for MS SQL filters and a transformation routine to rewrite the clause in PostgreSQL slang. In passing, we transform the filter constants using the same transformation functions as in the CAST rules, so that e.g. a MS SQL bit(1) value that got transformed into a PostgreSQL boolean is properly translated, as in the following example: MS SQL: "([deleted]=(0))" (that's from the catalogs) PostgreSQL: deleted = 'f' Of course the parser is still very badly tested, let's see what happens in the wild now. (Should) Fix #365.	2016-03-21 23:39:45 +01:00
Dimitri Fontaine	1ed07057fd	Implement --on-error-stop command line option. The implementation uses the dynamic binding on-error-stop so it's also available when pgloader is used as Common Lisp librairy. The (not-all-that-) recent changes made to the error handling make that implementation straightforward enough, so let's finally do it! Fix #85.	2016-03-21 20:52:50 +01:00
Dimitri Fontaine	4155d06ae5	Improve support for MS SQL multicolumn indexes. Once more we can't use an aggregate over a text column in MS SQL to build the index definition from its catalog structure, so we have to do that in the lisp part of the code. Multi-column indexes are now supported, but filtered indexes still are a problem: the WHERE clause in MS SQL is not compatible with the PostgreSQL syntax (because of [names] and type casting. For example we cast MS SQL bit to PostgreSQL boolean, so WHERE ([deleted]=(0)) should be translated to WHERE not deleted And the code to do that is not included yet. The following documentation page offers more examples of WHERE expression we might want to support: https://technet.microsoft.com/en-us/library/cc280372.aspx WHERE EndDate IS NOT NULL AND ComponentID = 5 AND StartDate > '01/01/2008' EndDate IN ('20000825', '20000908', '20000918') It might be worth automating the translation to PostgreSQL syntax and operators, but it's not done in this patch. See #365, where the created index will now be as follows, which is a problem because of being UNIQUE: some existing data won't reload fine. CREATE UNIQUE INDEX idx_<oid>_foo_name_unique ON dbo.foo (name, type, deleted);	2016-03-18 11:01:06 +01:00
Dimitri Fontaine	f1fe9ab702	Assorted fixes to MS SQL support. Having been given a test instance of a MS SQL database allows to quickly fix a series of assorted bugs related to schema handling of MS SQL databases. As it's the only source with a proper notion of schema that pgloader supports currently, it's not a surprise we had them. Fix #343. Fix #349. Fix #354.	2016-03-16 21:43:04 +01:00
Dimitri Fontaine	c724018840	Implement ALTER TABLE clause for MySQL migrations. The new ALTER TABLE facility allows to act on tables found in the MySQL database before the migration happens. In this patch the only provided actions are RENAME TO and SET SCHEMA, which fixes #224. In order to be able to provide the same option for MS SQL users, we will have to make it work at the SCHEMA level (ALTER SCHEMA ... RENAME TO ...) and modify the internal schema-struct so that the schema slot of our table instances are a schema instance rather than its name. Lacking MS SQL test database and instance, the facility is not yet provided for that source type.	2016-03-06 21:51:33 +01:00
Dimitri Fontaine	782561fd4e	Handle default value transforms errors, fix #333 . It turns out that MySQL catalog always store default value as strings even when the column itself is of type bytea. In some cases, it's then impossible to transform the expected bytea from a string. In passing, move some code around to fix dependencies and make it possible to issue log warnings from the default value printing code.	2016-02-03 12:27:58 +01:00
Dimitri Fontaine	76668c2626	Review package dependencies. The decision to use lots of different packages in pgloader has quite strong downsides at times, and the manual managment of dependencies is one of the, in particular how to avoid circular ones.	2016-01-31 18:42:01 +01:00
Dimitri Fontaine	7dd69a11e1	Implement concurrency and workers for files sources. More than the syntax and API tweaks, this patch also make it so that a multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be loaded with several files in the group in parallel. To that effect, tweak again the md-connection and md-copy implementations.	2016-01-16 22:53:55 +01:00
Dimitri Fontaine	aa8b756315	Fix when to create indexes. In the recent refactoring and improvements of parallelism the indexes creation would kick in before we know that the data is done being copied over to the target table. Fix that by maintaining a writers-count hashtable and only starting to create indexes when that count reaches zero, meaning all the concurrent tasks started to handle the COPY of the data are now done.	2016-01-16 19:50:21 +01:00
Dimitri Fontaine	f256e12a4f	Review load parallelism settings. pgloader parallel workload is still hardcoded, but at least the code now uses clear parameters as input so that it will be possible in a later patch to expose them to the end-user. The notions of workers and concurrency are now handled as follows: - concurrency is how many tasks are allowed to happen at once, by default we have a reader thread, a transformer thread and a COPY thread all actives for each table being loaded, - worker-count is how many parallel threads are allowed to run simultaneously and default to 8 currently, which means that in a typical migration from a database source and given default concurrency or 1 (3 threads), we might be loaded up to 3 different tables at any time. The idea is to expose those settings to the user in the load file and as command line options (such as --jobs) and see what it gives us. It might help e.g. use more cores in loading a single CSV file. As of this patch, there still can only be only one reader thread and the number of transformer threads must be the same as the number of COPY threads. Finally, the CSV-like files user-defined projections are now handled in the tranformation threads rather than in the reader thread...	2016-01-11 01:43:38 +01:00
Dimitri Fontaine	a3fd22acd3	Review pgloader encoding story. Thanks to Common Lisp character data type, it's easy for pgloader to enforce always speaking to PostgreSQL in utf-8, and that's what has been done from the beginning actually. Now, without good reason for that, the first example of a SET clause that has been added to the docs where about how to set client_encoding, which should NOT be done. Fix that at the use level by removing the bad example from the docs and adding a WARNING whenever the client_encoding is set to a known bad value. It's a WARNING because we then simply force 'utf-8' anyway. Also, review completely the format-vector-row function to avoid doing double work with the Postmodern facilities we piggyback on. This was done halfway through and the utf-8 conversion was actually done twice.	2016-01-11 01:27:36 +01:00
Dimitri Fontaine	9e4938cea4	Implement PostgreSQL catalogs data structure. In order to share more code in between the different source types, finally have a go at the quite horrible mess of anonymous data structures floating around. Having a catalog and schema instances not only allows for code cleanup, but will also allow to implement some bug fixes and wishlist items such as mapping tables from a schema to another one. Also, supporting database sources having a notion of "schema" (in between "catalog" and "table") should get easier, including getting on-par with MySQL in the MS SQL support (materialized views has been asked for already). See #320, #316, #224 for references and a notion of progress being made. In passing, also clean up the copy-databases methods for database source types, so that they all use a fetch-metadata generic function and a prepare-pgsql-database and a complete-pgsql-database generic function. Actually, a single method does the job here. The responsibility of introspecting the source to populate the internal catalog/schema representation is now held by the fetch-metadata generic function, which in turn will call the specialized versions of list-all-columns and friends implementations. Once the catalog has been fetched, an explicit CAST call is then needed before we can continue. Finally, the fields/columns/transforms slots in the copy objects are still being used by the operative code, so the internal catalog representation is only used up to starting the data copy step, where the copy class instances are then all that's used. This might be refactored again in a follow-up patch.	2015-12-30 21:53:01 +01:00
Dimitri Fontaine	b4bfa18877	Fix more table name quoting, fix #163 again. Now that we can have several threads doing COPY, each of them need to know about the pgsql-reserved-keywords list. Make sure that's the case and in passing fix some call sites to apply-identifier-case. Also, more disturbingly, fix the code so that TRUNCATE is called from the main thread before giving control to the COPY threads, rather than having two concurrent threads doing the TRUNCATE twice. It's rather strange that we got no complaint from the field on that part...	2015-12-08 11:52:43 +01:00
Dimitri Fontaine	cca44c800f	Simplify batch and transformation handling. Make batches of raw data straight from the reader output (map-rows) and have the transformation worker focus on changing the batch content from raw rows to copy strings. Also review the organisation of responsabilities in the code, allowing to move queue.lisp into utils/batch.lisp, renaming it as its scope has been reduced to only care about preparing batches. This came out of trying to have multiple workers concurrently processing the batches from the reader and feeding the hardcoded 2 COPY workers, but it failed for multiple reasons. All is left as of now is this cleanup, which seems to be on the faster side of things, which is always good.	2015-11-29 17:35:25 +01:00
Dimitri Fontaine	150d288d7a	Improve our regression testing facility. Next parallelism improvements will allow pgloader to use more than one COPY thread to load data, with the impact of changing the order of rows in the database. Rather than doing a copy out and `diff` of the data just loaded, load the reference data and do the diff in SQL: select * from loaded.data except select * from expected.data If such a query returns any row, we know we didn't load what was expected and the regression test is failing. This regression testing facility should also allow us to finally add support for multiple-table regression tests (sqlite, mysql, etc).	2015-11-17 17:03:08 +01:00
Dimitri Fontaine	f8ae9f22b9	Implement support for SSL client certificates. This fixes #308 by automatically using the PostgreSQL Client Side SSL files as documented in the following reference: http://www.postgresql.org/docs/current/static/libpq-ssl.html#LIBPQ-SSL-FILE-USAGE This uses the Postmodern special support for it. Unfortunately couldn't test it locally other than it doesn't break non-ssl connections. Pushing to have user feedback.	2015-11-09 11:32:17 +01:00
Dimitri Fontaine	e3cc76b2d4	Export copy-column-list from pgloader.sources. That allows the copy-column-list specific method for MySQL to be a method of the common pgloader.sources::copy-column-list generic function, and then to be called again when needed. This fixes an oversight in #41e9eeb and fixes #132 again.	2015-11-08 18:48:54 +01:00
Dimitri Fontaine	4df3167da1	Introduce another worker thread: transformers. We used to have a reader and a writer cooperating concurrently into loading the data from the source to PostgreSQL. The tranformation of the data was then the responsibility of the reader thread. Measurements showed that the PostgreSQL processes were mostly idle, waiting for the reader to produce data fast enough. In this patch we introduce a third worker thread that is responsible for processing the raw data into pre-formatted batches, allowing the reader to focus on extracting the data only. We now have two lparallel queues involved in the processing, the raw queue contains the vectors of raw data directly, and the processed-queue contains batches of properly encoded strings for the COPY text protocol. On the test laptop the performance gain isn't noticeable yet, it might be that we need much larger data sets to see a gain here. At least the setup isn't detrimental to performances on smaller data sets. Next improvements are going to allow more features: specialized batch retry thread and parallel table copy scheduling for database sources. Let's also continue caring about performances and play with having several worker and writer threads for each reader. In later patches. And some day, too, we will need to make the number of workers a user defined variable rather than something hard coded as today. It's on the todo list, meanwhile, dear user, consider changing the (make-kernel 6) into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp, and consider enlighting me with whatever it is you find by doing so!	2015-10-23 00:17:58 +02:00
Dimitri Fontaine	4f3b3472a2	Cleanup and timing display improvements. Have each thread publish its own start-time so that the main thread may compute time spent in source and target processing, in order to fix the crude hack of taking (max read-time write-time) in the total time column of the summary. We still have some strange artefacts here: we consider that the full processing time is bound to the writer thread (:target), because it needs to have the reader done already to be able to COPY the last batch... but in testing I've seen some :source timings higher than the :target ones... Let's solve problems one at a time tho, I guess multi-threading and accurate wall clock times aren't to be expected to mix and match that easily anyway (multi cores, single RTC and all that).	2015-10-22 22:36:40 +02:00
Dimitri Fontaine	633067a0fd	Allow more parallelism in database migrations. The newly added statistics are showing that read+write times are not enough to explain how long we wait for the data copying, so it must be the workers setup rather than the workers themselves. From there, let lparallel work its magic in scheduling the work we do in parallel in pgloader: rather than doing blocking receive-result calls for each table, only receive-result at the end of the whole copy-database processing. On test data here on the laptop we go from 6s to 3s to migrate the sakila database from MySQL to PostgreSQL: that's because we have lots of very small tables, so the cost of waiting after each COPY added up quite quickly. In passing, stop sharing the same connection object in between parallel workers that used to be controlled active in-sequence, see the new API clone-connection (which takes over new-pgsql-connection).	2015-10-20 22:15:55 +02:00
Dimitri Fontaine	41e9eebd54	Rationalize common generic API implementation. When devising the common API, the first step has been to implement specific methods for each generic function of the protocol. It now appears that in some cases we don't need the extra level of flexibility: each change of the API has been systematically reported to all the specific methods, so just use a single generic definition where possible. In particular, introduce new intermediate class for COPY subclasses allowing to share more common code in the methods implementation, rather than having to copy/paste and maintain several versions of the same code. It would be good to be able to centralize more code for the database sources and how they are organized around metadata/import-data/complete schema, but it doesn't look obvious how to do it just now.	2015-10-05 21:25:21 +02:00
Dimitri Fontaine	c880f86bb6	Fix user defined casting rules. Commit `598c860cf5` broke user defined casting rules by interning "precision" and "scale" in the pgloader.user-symbols package: those symbols need to be found in the pgloader.transforms package instead. Luckily enough the infrastructure to do that was already in place for cl:nil.	2015-10-05 02:11:23 +02:00
Dimitri Fontaine	96a33de084	Review the stats and reporting code organisation. In order to later be able to have more worker threads sharing the load (multiple readers and/or writers, maybe more specialized threads too), have all the stats be managed centrally by a single thread. We already have a "monitor" thread that get passed log messages so that the output buffer is not subject to race conditions, extend its use to also deal with statistics messages. In the current code, we send a message each time we read a row. In some future commits we should probably reduce the messaging here to something like one message per batch in the common case. Also, as a nice side effect of the code simplification and refactoring this fixes #283 wherein the before/after sections of individual CSV files within an ARCHIVE command where not counted in the reporting.	2015-10-05 01:46:29 +02:00
Dimitri Fontaine	598c860cf5	Improve user code parsing, fix #297 . To be able to use "t" (or "nil") as a column name, pgloader needs to be able to generate lisp code where those symbols are available. It's simple enough in that a Common Lisp package that doesn't :use :cl fullfills the condition, so intern user symbols in a specially crafted package that doesn't :use :cl. Now, we still need to be able to run transformation code that is using the :cl package symbols and the pgloader.transforms functions too. In this commit we introduce a heuristic to pick symbols either as functions from pgloader.transforms or anything else in pgloader.user-symbols. And so that user code may use NIL too, we provide an override mechanism to the intern-symbol heuristic and use it only when parsing user code, not when producing Common Lisp code from the parsed load command.	2015-09-21 13:23:21 +02:00
Dimitri Fontaine	eabfbb9cc8	Fix schema qualified table names usage (more). When parsing table names in the target URI, we are careful of splitting the table and schema name and store them into a cons in that case. Not all sources methods got the memo, clean that up. See #182 and #186, a pull request I am now going to be able to accept. Also see #287 that should be helped by being able to apply #186.	2015-09-04 01:06:15 +02:00
Dimitri Fontaine	04aa743eb7	Cleanup file based "connections". When the notion of a connection class with a generic set of method was invented, the very flexible specification formats available for the file based sources where not integrated into the new connection system. This patch provides a new connection class md-connection with a specific sub-protocol (after opening a connection, the caller is supposed to loop around open-next-stream) so that it's possible to both properly fit into the connection concept and to better share the code in between our three implementation (csv, copy, fixed).	2015-08-24 16:33:00 +02:00
Dimitri Fontaine	ea35eb575d	Implement --dry-run option, fix #264 . The dry run option will currently only check database connections, but as that happens after having correctly parsed the load file, it allows to also check that the command file is correct for the parser. Note that the list load-data API isn't subject to the dry-run method. In passing, we add some more API entry points to the connection objects and we should actually clean the code base to use the new QUERY generic all over the place. It's for another patch tho.	2015-08-22 16:23:47 +02:00
Dimitri Fontaine	56a89e9b53	Cleanup schema data structure building. As reported by clisp maintainer (thanks jackdaniel!) when trying to load pgloader, we had redoundant labels function names in places. Get rid of those by pushing the new columns found directly at the end of the list, avoiding the bulky code to then reverse the complex anonymous data structure. The Real Fix™ would be to define proper structures where to hold all those database catalogs representation, but that's an invasive patch and now isn't a good time to write it. At least pgloader should load and run with clisp now.	2015-08-15 23:54:45 +02:00
Dimitri Fontaine	a98788b670	Implement drop indexes option for copy and fixed. The option doesn't seem relevant to the db3 source type which contains a table definition: pgloader will create the table from scratch and no indexes are going to be found.	2015-07-16 21:39:06 +02:00
Dimitri Fontaine	4f2385fa4c	Refactor code from previous commit. The goal is to make it easy to add support for the 'drop indexes' option in other source types (fixed, ixf, db3, file-based sources).	2015-07-16 19:35:34 +02:00
Dimitri Fontaine	49bf7e56f2	Implement a "drop indexes" option in CSV mode, fix #251 . When loading against a table that already has index definitions, the load can be quite slow. Previous commit introduced a warning in such a case. This commit introduces the option "drop indexes" that is not used by default. When this option is used, pgloader drops the indexes before loading the data then create the indexes again with the same definitions as before. All the indexes are created again in parallel to optimize performances. Only primary key indexes can't be created in parallel, so those are created in two steps (create unique index then alter table).	2015-07-16 12:22:58 +02:00
Dimitri Fontaine	7c834db6e3	Warn against pre-existing indexes. Pre-existing indexes will reduce data loading performances and it's generally better to DROP the index prior to the load and CREATE them again once the load is done. See #251 for an example of that. In that patch we just add a WARNING against the situation, the next patch will also add support for a new WITH clause option allowing to have pgloader take care of the DROP/CREATE dance around the data loading.	2015-07-16 12:22:58 +02:00
Dimitri Fontaine	1f7382cd0d	Fix error counts when transformation functions fail. Related to #249, stop reporting 0 errors on sources where we failed to handle some data transformation.	2015-06-27 19:32:08 +02:00
Dimitri Fontaine	533e6b623f	Review upgrade config code, fix #235 . The database connection code needed to switch to the "new" connection facilities, and there was a bug in the processing of template sections wherein the template user would inherit the template property.	2015-05-19 18:12:10 +02:00
Dimitri Fontaine	0068a45e1c	Fix parsing of qualified target table names, see #186 . We used to parse qualified table names as a simple string, which then breaks attempts to be smart about how to quote idenfifiers. Some sources are known to accept dots in quoted table names and we need to be able to process that properly without tripping on qualified table names too late. Current code might not be the best approach as it's just using either a cons or a string for table names internally, rather than defining a proper data structure with a schema and a name slot. Well, that's for a later cleanup patch, I happen to be lazy tonight.	2015-04-17 23:22:30 +02:00
Dimitri Fontaine	cd46b6cbed	Clean up common code for sources. Only move code around, creating a src/sources/common directory with several files in there so as to split the too big src/sources.lisp.	2015-01-08 23:17:40 +01:00
Dimitri Fontaine	2caefb0836	Fix and improve new summary reporting.	2015-01-06 12:36:14 +01:00
Dimitri Fontaine	ad8fb0b2a4	Implement machine readable summary files, fixes #144 . It's now possible to have pgloader print out its summary in one of several formats: human-readable (default), csv, copy or json. The choice of format is made depending on the extension of the summary filename picked on the command line with the option --summary.	2015-01-06 01:22:01 +01:00
Dimitri Fontaine	d9f5bff5e0	Cleanup some code location.	2015-01-06 01:18:54 +01:00
Dimitri Fontaine	a86369a03d	Improve the CLI situation a bit. Fix bugs related to parsing the new COPY type, and make it so that we know how to parse options (and fields, and other type dependant things) even when --type is missing, in care the source URL has the information.	2015-01-06 00:07:31 +01:00
Dimitri Fontaine	e1bc6425e2	Implement support for PostgreSQL COPY format, fix #145 . PostgreSQL COPY format is not really CSV but something way easier to parse. Funnily enough, parsing it as CSV is not that easy, so we add here a special simple parser for the COPY format. It should be quite useful too try loading again reject data files from pgloader after manual fixing, too. It's still missing some documentation without any good excuse for that, will add soon.	2015-01-02 18:49:17 +01:00
Dimitri Fontaine	40f3c4f769	Improve HTTP handling of CSV and Fixed data sources. In passing also allow --field to specify the whole field list, there's no point in forcing the user to have as many --field switches on the command line as they have columns in their data source file.	2014-12-27 17:02:19 +01:00
Dimitri Fontaine	13992121b3	Export more utilities in the pgloader package. That makes like of other CL users wanted to play with pgloader way easier.	2014-12-26 22:00:22 +01:00
Dimitri Fontaine	302a7d402b	Refactor connection handling, and clean-up many things. That's the big refactoring patch I've been sitting on for too long. First, refactor connection handling to use a uniformed "connection" concept (class and generic functions API) everywhere, so that the COPY derived objects just use that in their :source-db and :target-db slots. Given that, we don't need no messing around with pgconn and myconn- and other special variables at all anywhere in the tree. Second, clean up some oddities accumulated over time, where some parts of the code didn't get the memo when new API got into place. Third, fix any other oddity or missing part found while doing those first two activities, it was long overdue anyway...	2014-12-26 21:50:29 +01:00
Dimitri Fontaine	6eac0d9dd8	Implement --before and --after options on the command line. That allows using SQL scripts to run before and after the main data processing and loading done by pgloader when used only from the command line.	2014-12-23 12:21:44 +01:00
Dimitri Fontaine	65c2043694	Improve pgloader usage from the command line. Make it so that the following command line usages are accepted when using pgloader without a command file: ./build/bin/pgloader ./test/sqlite/sqlite.db postgresql:///pgloader ./build/bin/pgloader --set "search_path='sakila'" \ mysql://root@localhost/sakila \ postgresql:///sakila ./build/bin/pgloader --type csv \ --field id --field field \ --with truncate \ --with "fields terminated by ','" \ ./test/data/matching-1.csv \ postgres:///pgloader?matching It's now possible in most cases to just use command-line options, which should make the entry bar to pgloader much lower.	2014-12-23 02:40:13 +01:00
Dimitri Fontaine	f20d7cb452	Some more cleanup after the pgconn refactoring.	2014-12-19 14:41:49 +01:00
Dimitri Fontaine	5b726e47a0	Improve error reporting on connection error.	2014-12-19 14:24:35 +01:00
Dimitri Fontaine	d4cab3a81e	Fix MSSQL index and foreign key names. First, the index names in MS SQL, as in MySQL, are only unique per table, whereas they need to be globally unique (per database) in PostgreSQL. So reuse the infrastructure we had for MySQL here. Second, the way we trick table names in index and fkey structures means that we already did quote the names and we don't want to quote them again, so add a new possible identifier-case value to handle the case where nothing is to be done, pretty please.	2014-11-25 00:42:37 +01:00
Dimitri Fontaine	f263d1b2a4	Implement Foreign Key support for MSSQL. Piggyback as much as possible on the work already done for MySQL.	2014-11-24 23:42:19 +01:00

1 2

90 Commits