pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2026-01-28 10:31:27 +01:00

Author	SHA1	Message	Date
Dimitri Fontaine	381ac9d1a2	Add initial support for Citus distribution from pgloader. The idea is for pgloader to tweak the schema from a description of the sharding model, the distribute clause. Here's an example of such a clause: distribute company using id distribute campaign using company_id distribute ads using company_id from campaign distribute clicks using company_id from ads, campaign Given such commands, pgloader adds the distibution key to the table when needed, to the primary key definition of the table, and also to the foreign keys that are pointing to the changed primary key. Then when SELECTing the data from the source database, the idea is for pgloader to automatically JOIN the base table with the source table where to find the distribution key, in case it was just added in the schema. Finally, pgloader also calls the following Citus commands: SELECT create_distributed_table('company', 'id'); SELECT create_distributed_table('campaign', 'company_id'); SELECT create_distributed_table('ads', 'company_id'); SELECT create_distributed_table('clicks', 'company_id');	2018-10-10 14:35:12 -07:00
Dimitri Fontaine	5119d864f4	Assorted bug fixes in the context of Redshift support as a source. The catalog queries used in pgloader have to be adjusted for Redshift because this thing forked PostgreSQL 8.0, which is a long time ago now. Also, we had a couple bugs here and there that were not really related to Redshift support but were shown in that context. Fixes #813.	2018-09-04 11:49:21 +02:00
Dimitri Fontaine	4fbfd9e522	Refrain from using regexp_match() function, introduced in Pg10. Instead use the substring() function which has been there all along. See #813.	2018-08-22 10:52:01 +02:00
Dimitri Fontaine	cb633aa092	Refrain from some introspections on non-PGDG PostgreSQL variants. When dealing with PostgreSQL protocol compatible databases, often enough they don't support the same catalogs as PostgreSQL itself. Redshift for instance lacks foreign key support.	2018-08-20 11:52:59 +02:00
Dimitri Fontaine	d3bfb1db31	Bugfix previous commit: filter list format changed. We now accept the more general string and regex match rules, but the code to generate including and excluding lists from the catalogs had not been updated.	2018-08-20 11:50:50 +02:00
Dimitri Fontaine	fc3a1949f7	Add support for PostgreSQL as a source database. It's now possible to use pgloader to migrate from PostgreSQL to PostgreSQL. That might be useful for several reasons, including applying user defined cast rules at COPY time, or just moving from an hosted solution to another.	2018-08-20 11:09:52 +02:00
Dimitri Fontaine	a0bac47101	Refrain from TRUNCAT'ing an empty list of tables. Fixed #789.	2018-06-15 17:46:31 +02:00
Dimitri Fontaine	3db3ecf81b	Review Redshift data type dumb-down choices. It's a little more involved that what was done previously. In particular we need to pay attention to MySQL varchar(x) and transform them into something big enough when counting bytes rather than chars, like varchar(3x). Then there's the "text" datatype to take into account, and some more.	2018-05-23 13:43:28 +02:00
Dimitri Fontaine	d4dc4499a8	Add schema migration support for Redshift as a target. Redshift looks like a very old PostgreSQL (8.0.2) with some extra features and a very limited selection of data types. In this patch we parse the PostgreSQL version() function output and automatically determine if we're connected to Redshift. When connected to Redshift, we then dumb-down our target catalogs to the subset of data types that Redshift actually does support. Also, some catalog queries can't be done in Redshift, and 8.0 didn't have fully compliant VALUES statement, so we use a temporary table in places where we used to use SELECT ... FROM (VALUES(...)) in pgloader. COPYing data to Redshift isn't possible with just this set of changes, because Redshift also don't support the COPY FROM STDIN form. COPY sources are limited, and another patch will have to be cooked to prepare the data from pgloader into a format and location that Redshift knows how to handle. At least, it's possible to migrate a database schema to Redshift already.	2018-05-19 19:16:58 +02:00
Dimitri Fontaine	48af01dbbc	Fix implementation of foreign keys in data only mode. In data-only mode, the foreign keys parameter (which defaults to True) means something special: we remove the fkey definitions prior to the data only load then re-install the fkeys. This got broken in a previous commit, the WITH clause option being processed like the other DDL ones that only make sense when creating the schema. While fixing the setting in copy-database, we have to also fix a nesting bug in complete-pgsql-database that would prevent fkey to be installed again at the end of the load. This patch not only fix that choice, but also review the implementation of the drop-pgsql-fkeys support function to use more modern internal API, preparing a list of SQL statements to be sent to the psql-execute level. Fixes #745.	2018-02-19 22:07:43 +01:00
Dimitri Fontaine	e129e77eb6	Fix SQL execute counters maintenance.	2018-02-19 22:06:51 +01:00
Dimitri Fontaine	4fed8c5eca	Fix support for newid() from MS SQL. Several places in the code are involved to deal with the default values from MS SQL. The catalog query is dealing with strange quoting rules on the source side and used to fill in directly the PostgreSQL expected value. But then the quoting of a function call wasn't properly handled. Rather than coping with the quoting rules here, have the catalog query return a pgloader specific placeholder "GENERATE_UUID". Then the MS SQL specific code can normalize that to the symbol :generate_uuid. Then the generic PostgreSQL DDL code can implement the proper replacement for that symbol, not having to know where it comes from. Fix #742.	2018-02-17 00:25:33 +01:00
Dimitri Fontaine	5e3acbb462	When merging catalogs, "float" and "double precision" the same type. PostgreSQL understands both spellings of the data type name and implements float as being a double precision value, so we should refrain from any warning about that non-discrepency when doing a data-only load. Should fix #746.	2018-02-16 23:42:46 +01:00
Dimitri Fontaine	4612e68435	Implement support for new casting rules guards and actions. Namely the actions are “keep extra” and “drop extra” and the casting rule guard is “with extra on update current timestamp”. Having support for those elements in the casting rules allow such a definition as the following: type timestamp with extra on update current timestamp to "timestamp with time zone" drop extra The effect of such as cast rule would be to ignore the MySQL extra definition and then refrain pgloader from creating the PostgreSQL triggers that implement the same behavior. Fix #735.	2018-01-31 15:17:05 +01:00
Dimitri Fontaine	5ba42edb0c	Review misleading error message with schema not found. It might be that the schema exists but we didn't find what we expected to in there, so that it didn't make it to pgloader's internal catalogs. Be friendly to the user with a better error message. Fix #713.	2018-01-25 23:29:36 +01:00
Dimitri Fontaine	adf03c47ad	Clean up source code organisation. The copy format and batch facilities are no longer the meat of your PostgreSQL support in the src/pgsql directory, so have them leave in their own space.	2018-01-23 19:52:13 +01:00
Dimitri Fontaine	3bb128c5db	Review format-vector-row. This function prepares the data to be sent down to PostgreSQL as a clean COPY text with unicode handled correctly. This commit is mainly a clean-up of the function, and also adds some smarts to try and make it faster. In testing, the function is now tangentially faster than before, but not by much. The hope here is that it's now easier to optimize it.	2018-01-22 21:37:14 +01:00
Dimitri Fontaine	c05183fcba	Implement support for Foreign Tables and Partitionned Tables. Due to the way pgloader queries the PostgreSQL catalogs, it restricted the target table to be “ordinary” tables, as per the relkind description in the https://www.postgresql.org/docs/current/static/catalog-pg-class.html PostgreSQL documentation. Extend this to support relkind of 'r', 'f' and 'p'. Fixes #587, fixes #690.	2017-12-01 22:13:47 +01:00
Dimitri Fontaine	6964764fb4	Find schema names unquoted. When doing a MySQL to PostgreSQL migration in data only mode, pgloader matches schema names found on both source and target database, and much like with table names must do so ensuring unquoted schema names. Otherwise we fail to find the schema name again, because one spelling has the quotes, but not the other one, when using the “quote identifiers” option. Fix #659, at least some forms of it.	2017-11-19 17:12:21 +01:00
Dimitri Fontaine	db7a91d6c4	Add the MySQL target schema to the search_path. In the next release, pgloader defaults to targetting a new schema named the same as the MySQL database, because that's what makes more sense. But people are used to having 'public' in the search_path and everything in there. So when creating our target schema, when migrating from MySQL, arrange it so that the new schema is in the search_path by issuing a command like: ALTER DATABASE plop SET search_path TO public, f1db; And make this command visible in verbose (NOTICE) mode too, so that user can see what happens. Fix #654. I think.	2017-11-02 12:40:21 +01:00
Dimitri Fontaine	0a88645eb5	Fix time measurements of the write activity. When using --verbose or more detailed log messages, the summary prints timings for both read and write operations separately. The write summary timing took into account only the PostgreSQL batch activity, discarding the formatting of the data done by pgloader. As this formatting is quite heavy at the moment, the results are pretty misleading without that information.	2017-10-21 21:04:55 +02:00
Dimitri Fontaine	8a361a0ff8	Add support for multiple on update columns per table. The MySQL special syntax "on update current_timestamp()" used to support only a single column per table (in MySQL), and so did pgloader. In MariaDB version 10 it's now possible to have several column with that special treatment, so adapt pgloader to migrate that too. What pgloader does is recognize that several columns are to receive the same pre-update processing, and creates a single function that does the both of them, as in the following example, from pgloader logs in a test case: CREATE OR REPLACE FUNCTION mysql.on_update_current_timestamp_onupdate() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN NEW.update_date = now(); NEW.calc_date = now(); RETURN NEW; END; $$; CREATE TRIGGER on_update_current_timestamp BEFORE UPDATE ON mysql.onupdate FOR EACH ROW EXECUTE PROCEDURE mysql.on_update_current_timestamp_onupdate(); Fixes #629.	2017-09-15 01:04:57 +02:00
Dimitri Fontaine	a498313074	Implement support for MySQL FULLTEXT indexes. PostgreSQL btree indexes are limited in the size of the values they can index: values must fit in an index page (8kB). So when porting a MySQL full text index over full documents, we might get into an error like the following: index row size 2872 exceeds maximum 2712 for index "idx_5199509_search" To fix, query MySQL for the index type which is FULLTEXT rather than BTREE in those cases, and port it over to a PostgreSQL Full Text index with an hard-coded 'simple' configuration, such as the following test case: CREATE INDEX idx_75421_search ON mysql.fcm_batches USING gin(to_tsvector('simple', raw_payload)); Of course users might want to use a better configuration, including proper dictionnary for the documents. When using PostgreSQL each document may have its own configuration attached and yet they can all get indexed into the same index, so that's a task for the application developpers, not for pgloader. In passing, fix the list-typenames-without-btree-support.sql query to return separate entries for each index type rather than an {array,representation} of the result, as Postmodern won't turn the PostgreSQL array into a Common Lisp array by default. I'm kept wondering how it worked before. Fix #569.	2017-09-14 15:40:34 +02:00
Dimitri Fontaine	987c0703ad	Some default values come properly quoted from MariaDB now. Adjust the default value formating to check if the default value is already single-quoted and only add new 'single quotes' when it's not the case. Apparently ENUM default values in MariaDB 10 are now properly single quoted.	2017-09-14 15:39:04 +02:00
Dimitri Fontaine	72c58306ba	Fix the previous fix. See #614. Again. Should be ok now.	2017-08-25 01:56:34 +02:00
Dimitri Fontaine	f20a5a0667	Fix schema name comparing with quoted schema names. In the previous commit we introduced support for database names including spaces, which means that by default pgloader creates a target schema in PostgreSQL with a space in its name. That works well as soon as you always double-quote the schema name, which pgloader does. Now, in our internal catalogs, we keep the schema name double-quoted. And when comparing that schema names with quotes to the raw schema name from PostgreSQL, they won't match, and pgloader tries to create the schema again: ERROR Database error 42P06: schema "my sql" already exists Fix the comparing to compare unquoted schema name, fix #614 again: the previous fix would only work the first time.	2017-08-25 01:47:49 +02:00
Dimitri Fontaine	4fcb24f448	Reintroduce manual Garbage Collect in SBCL. It seems that SBCL still needs some help in deciding when to GC with very large values. In a test case with a “data” column averaging 375kB (up to about 3 MB per datum), it allows much larger batch size and prefetch rows settings without entering lldb.	2017-08-23 16:27:14 +02:00
Dimitri Fontaine	4f9eb8c06b	Track bytes sent to PostgreSQL. The pgstate infrastructure already had lots of details about what's going on, add to it the information about how many bytes are sent in every batch, and use this information in the monitor when something long is happening to display how many rows we sent from the beginning for this (supposedly) huge table, along with bytes and speed (bytes per seconds).	2017-08-23 11:55:49 +02:00
Dimitri Fontaine	1f242cd29e	Fix comment support to schema qualify target tables.	2017-08-23 11:26:08 +02:00
Dimitri Fontaine	28db6b9f13	Desultory cleanup of a useless declaim.	2017-08-21 16:46:32 +02:00
Dimitri Fontaine	03a8d57a50	Review --verbose log message. The verbosity is not that easy to adjust. Remove useless messages and add a new one telling when the COPY of a table is done. As we might have to wait for some time for indexes being built. keep the CREATE INDEX lines. Also keep the ALTER TABLE both for primary keys and foreign keys, again because the user might have to wait for quite some time.	2017-08-21 15:27:13 +02:00
Dimitri Fontaine	952e7da191	Bug fix CREATE TYPE in schema (previous patch). The previous patch fixed CREATE TYPE so that ENUM types are created in the same schema than the table using them, but failed to update the DROP TYPE statements to also target this schema...	2017-08-10 21:19:25 +02:00
Dimitri Fontaine	5a65da2147	Create new types in the proper schema. Previously to this patch, pgloader wouldn't care about which schema it creates extra types in. Extra types are mainly ENUM and SET support from MySQL. Now, pgloader creates those extra PostgreSQL ENUM types in the same schema as the table using them, which is a more sound default.	2017-08-10 18:57:09 +02:00
Dimitri Fontaine	5c1c4bf3ff	Fix MySQL Enum parsing. We use a CSV parser for the MySQL enum values, but the quote escaping wasn't properly setup: MySQL quotes ENUM values with a single-quote (') and uses two of them ('') for escaping single-quotes when found in the ENUM value itself. Fixes #597.	2017-08-01 18:40:27 +02:00
Dimitri Fontaine	dfe5c38185	Fix quoting policy in PostgreSQL ddl formating. We already have apply-identifier-case and identifier-case to decide how and when to quote our SQL object names, so don't force extra quotes in format string: refrain from using ~s.	2017-07-06 09:47:48 +02:00
Dimitri Fontaine	9da012ca51	Fix identifiers quoting when reading PostgreSQL catalogs. We sure can trust PostgreSQL to use names it knows how to handle. Still, it will be happy to store in its catalogs names containing upper case, and in that case we must quote them.	2017-07-06 03:16:06 +02:00
Dimitri Fontaine	e37cb3a9e7	Split SQL queries into their own files. This change was long overdue. Ideally we would use something like the YeSQL library for Clojure, but it seems like the cl-yesql equivalent is not ready yet, and it depends on an experimental build system... So this patch introduces an URL abstraction built on-top of a hash table. You can then reference src/pgsql/sql/list-all-columns.sql as (sql "pgsql/list-all-columns.sql") in the source code directly. So for now the templating system is CL's format language. It is still an improvement from embedded string. Again, one step at a time.	2017-07-06 03:16:05 +02:00
Dimitri Fontaine	d50ed64635	Defensive programming, after though. It might be that a column-type-name is actually an sqltype instance, and then #'string= won't be happy. Prevent that now with discarding any smarts when the type name does not satisfies stringp.	2017-07-06 00:59:36 +02:00
Dimitri Fontaine	26d372bca3	Implement support for non-btree indexes (e.g. MySQL spatial keys). When pgloader fetches the index list from a source database, it doesn't fetch information about access methods for the indexes: I don't even know if the overlap in between index access methods from one RDMBS to another covers more than just btree... It could happen that MySQL indexes a "geometry" column tho. This datatype is converted automatically to "point" by pgloader, which is good. But the index creation would fail with the following error message: Database error 42704: data type point has no default operator class for access method "btree" In this patch when setting up the target schema we issue a PostgreSQL catalog query to dynamically list those datatypes without btree support and fetch their opclasses, with an hard-coded preference to GiST, then GIN, so as to be able to automatically use the proper access method when btree isn't available. And now pgloader transparently issues the proper statement: CREATE INDEX idx_168468_idx_location ON pagila.address USING gist(location); Currently this exploration is limited to indexes with a single column. To implement the general case we would need a more complex lookup: we would have to find the intersection of all the supported access methods for all involved columns. Of course we might need to do that someday. One step at a time is plenty good enough tho.	2017-07-06 00:42:43 +02:00
Dimitri Fontaine	8405c331a9	Error handling improvements for PostgreSQL schema. In the complete PostgreSQL schema step, an error would be logged as you expect but poorly handled: it would have the whole transaction rolled back, meaning that a single Primary Key definition failure would cancel all the others, plus the foreign keys, and also the triggers and comments. It happens that other systems allow a primary column to contain NULL values, which is forbidden in the standard and enforced by PostgreSQL, so that's not a theoritical concern here.	2017-07-05 17:53:33 +02:00
Dimitri Fontaine	bae40d40c3	Fix identifier quoting corner cases. In cases when pgloader needs to build a new identifer from existing ones (mainly for renaming indexes, because they are unique per-table in the source database and unique per-schema in PostgreSQL), and we compose the new name from already quoted strings, pgloader was doing the wrong thing. Fix that by having a build-identifier function that may unquote parts then re-quote properly (if needed) the new identifier.	2017-07-05 15:37:21 +02:00
Dimitri Fontaine	3f7853491f	Refactor PostgreSQL error handling. The code was too complex and the transaction / connection handling wasn't good enough, too many reconnections when a ROLLBACK; is all we need to be able to continue our processing. Also fix some stats counters about errors handled, and improve error message by adding PostgreSQL explicitely, and the name of the table where the error comes from.	2017-07-04 01:41:08 +02:00
Dimitri Fontaine	1e436555a8	Refactor PostgreSQL conditions. Use a single deftype postgresql-unavailable rather than copy/pasting the same list of conditions in several places.	2017-06-29 14:08:52 +02:00
Dimitri Fontaine	60c1146e18	Assorted fixes. Refrain from killing the Common Lisp image when doing interactive regression testing if we typo'ed the regression test file name...	2017-06-29 12:35:40 +02:00
Dimitri Fontaine	cea82a6aa8	Reconnect to PostgreSQL in case of connection lost. It may happen that PostgreSQL is restarted while pgloader is running, or that for some other reason we lose the connection to the server, and in most cases we know how to gracefully reconnect and retry, so just do so. Fixes #546 initial report.	2017-06-29 12:34:34 +02:00
Dimitri Fontaine	f0d1f4ef8c	Fix reduce usage with max function. The (reduce #'max ...) requires an initial value to be provided, as the max function wants at least 1 argument, as we can see here: CL-USER> (handler-case (reduce #'max nil) (condition (e) (format t "~a" e))) Too few arguments in call to #<Compiled-function MAX #x300000113C2F>: 0 arguments provided, at least 1 required.	2017-06-28 16:37:27 +02:00
Dimitri Fontaine	6d66280fa5	Review parallelism and memory behavior. The previous patch made format-vector-row allocate its memory in one go rather than byte after byte with vector-push-extend. In this patch we review our usage of batches and parallelism. Now the reader pushes each row directly to the lparallel queue and writers concurrently consume from it, cook batches in COPY format, and then send that chunk of data down to PostgreSQL. When looking at runtime profiles, the time spent writing in PostgreSQL is a fraction of the time spent reading from MySQL, so we consider that the writing thread has enough time to do the data mungling without slowing us down. The most interesting factor here is the memory behavor of pgloader, which seems more stable than before, and easier to cope with for SBCL's GC. Note that batch concurrency is no more, replaced by prefetch rows: the reader thread no longer build batches and the count of items in the reader queue is now a number a rows, not of batches of them. Anyway, with this patch in I can't reproduce the following issues: Fixes #337, Fixes #420.	2017-06-27 23:10:33 +02:00
Dimitri Fontaine	7f737a5f55	Reduce memory allocation in format-vector-row. This function is used on every bit of data we send down to PostgreSQL, so I have good hopes of reducing its memory allocation having an impact on loading times. In particular for sizeable data sets.	2017-06-27 15:31:49 +02:00
Dimitri Fontaine	e11ccf7bb7	Fix on-error-stop signaling. To properly handle on-error-stop condition, make it a specific pgloader condition with a specific handling behavior. In passing add some more log messages for surprising conditions. Fix #546.	2017-06-17 19:02:05 +02:00
Dimitri Fontaine	5faf8605ce	Fix corner cases and how we log them. In the prepare-pgsql-database method we were logging too much details, such as DDL warnings on if-not-exists for successful queries. And those logs are to be found in PostgreSQL server logs anyway. Also fix trying to create or drop a "nil" schema.	2017-06-17 18:16:18 +02:00

1 2 3 4 5

204 Commits