pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2026-02-12 01:41:02 +01:00

Author	SHA1	Message	Date
Dimitri Fontaine	5a65da2147	Create new types in the proper schema. Previously to this patch, pgloader wouldn't care about which schema it creates extra types in. Extra types are mainly ENUM and SET support from MySQL. Now, pgloader creates those extra PostgreSQL ENUM types in the same schema as the table using them, which is a more sound default.	2017-08-10 18:57:09 +02:00
Luke Snape	ecd6a8e25c	Ignore nulls in varbinary-to-string transform (#606 )	2017-08-07 21:37:37 +02:00
Dimitri Fontaine	e37cb3a9e7	Split SQL queries into their own files. This change was long overdue. Ideally we would use something like the YeSQL library for Clojure, but it seems like the cl-yesql equivalent is not ready yet, and it depends on an experimental build system... So this patch introduces an URL abstraction built on-top of a hash table. You can then reference src/pgsql/sql/list-all-columns.sql as (sql "pgsql/list-all-columns.sql") in the source code directly. So for now the templating system is CL's format language. It is still an improvement from embedded string. Again, one step at a time.	2017-07-06 03:16:05 +02:00
Dimitri Fontaine	26d372bca3	Implement support for non-btree indexes (e.g. MySQL spatial keys). When pgloader fetches the index list from a source database, it doesn't fetch information about access methods for the indexes: I don't even know if the overlap in between index access methods from one RDMBS to another covers more than just btree... It could happen that MySQL indexes a "geometry" column tho. This datatype is converted automatically to "point" by pgloader, which is good. But the index creation would fail with the following error message: Database error 42704: data type point has no default operator class for access method "btree" In this patch when setting up the target schema we issue a PostgreSQL catalog query to dynamically list those datatypes without btree support and fetch their opclasses, with an hard-coded preference to GiST, then GIN, so as to be able to automatically use the proper access method when btree isn't available. And now pgloader transparently issues the proper statement: CREATE INDEX idx_168468_idx_location ON pagila.address USING gist(location); Currently this exploration is limited to indexes with a single column. To implement the general case we would need a more complex lookup: we would have to find the intersection of all the supported access methods for all involved columns. Of course we might need to do that someday. One step at a time is plenty good enough tho.	2017-07-06 00:42:43 +02:00
Dimitri Fontaine	bae40d40c3	Fix identifier quoting corner cases. In cases when pgloader needs to build a new identifer from existing ones (mainly for renaming indexes, because they are unique per-table in the source database and unique per-schema in PostgreSQL), and we compose the new name from already quoted strings, pgloader was doing the wrong thing. Fix that by having a build-identifier function that may unquote parts then re-quote properly (if needed) the new identifier.	2017-07-05 15:37:21 +02:00
Dimitri Fontaine	f0d1f4ef8c	Fix reduce usage with max function. The (reduce #'max ...) requires an initial value to be provided, as the max function wants at least 1 argument, as we can see here: CL-USER> (handler-case (reduce #'max nil) (condition (e) (format t "~a" e))) Too few arguments in call to #<Compiled-function MAX #x300000113C2F>: 0 arguments provided, at least 1 required.	2017-06-28 16:37:27 +02:00
Dimitri Fontaine	17a63e18ed	Review "main" error handling. The "main" function only gets used at the command line, and errors where not cleanly reported to the users. Mainly because I almost never get to play with pgloader that way, prefering a load command file and the REPL environment, but that's not even acceptable as an excuse. Now the binary program should be able to exit cleanly in all situations. In testing, it may happens on unexpected erroneous situations that we quit before printing all the messages in the monitoring queue, but at least now we quit cleanly and with a non-zero exit status. Fix #583.	2017-06-28 16:36:08 +02:00
Dimitri Fontaine	0549e74f6d	Implement multiple reader per table for MySQL. Experiment with the idea of splitting the read work in several concurrent threads, where each reader is reading portions of the target table, using a WHERE id <= x and id > y clause in its SELECT query. For this to kick-in a number of conditions needs to be met, as described in the documentation. The main interest might not be faster queries to overall fetch the same data set, but better concurrency with as many readers as writters and each couple its own dedicated queue.	2017-06-28 16:23:18 +02:00
Dimitri Fontaine	6d66280fa5	Review parallelism and memory behavior. The previous patch made format-vector-row allocate its memory in one go rather than byte after byte with vector-push-extend. In this patch we review our usage of batches and parallelism. Now the reader pushes each row directly to the lparallel queue and writers concurrently consume from it, cook batches in COPY format, and then send that chunk of data down to PostgreSQL. When looking at runtime profiles, the time spent writing in PostgreSQL is a fraction of the time spent reading from MySQL, so we consider that the writing thread has enough time to do the data mungling without slowing us down. The most interesting factor here is the memory behavor of pgloader, which seems more stable than before, and easier to cope with for SBCL's GC. Note that batch concurrency is no more, replaced by prefetch rows: the reader thread no longer build batches and the count of items in the reader queue is now a number a rows, not of batches of them. Anyway, with this patch in I can't reproduce the following issues: Fixes #337, Fixes #420.	2017-06-27 23:10:33 +02:00
Dimitri Fontaine	2341ef195d	Review abnormal termination code path. In case of an exceptional condition leading to termination of the pgloader program we tried to use log-message after the monitor should have been closed. Also the 0.3s delay to let latests messages out looks like a poor design. This patch attempts to remedy both the situation: refrain from using a closed down monitoring thread, and properly wait until it's done before returning to the shell. See #583.	2017-06-27 10:45:54 +02:00
Dimitri Fontaine	352f4adc8d	Implement support for MySQL SET parameters. pgloader had support for PostgreSQL SET parameters (gucs) from the beginning, and in the same vein it might be necessary to tweak MySQL connection parameters, and allow pgloader users to control them. See #337 and #420 where net_read_timeout and net_write_timeout might need to be set in order to be able to complete the migration, due to high volumes of data being processed.	2017-06-27 10:00:47 +02:00
Dimitri Fontaine	6c931975de	Refrain from pushing too much logging trafic. In this patch we hard-code some cases when we know the log message won't be displayed anywhere so as to avoid sending it to the monitor thread. It certainly is a modularity violation, but given the performance impact...	2017-06-17 18:12:33 +02:00
Dimitri Fontaine	7f55b21044	Improve support for http(s) resources. The code used to take into account content-length HTTP header to load that number of bytes in memory from the remote server. Not only it's better to use a fixed size allocated-once buffer for that (now 4k), but also doing so allows downloading content that you don't know the content-length of. In passing tell the HTTP-URI parser rule that we also accept https:// as a prefix, not just http://. This allows running pgloader in such cases: $ pgloader https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite pgsql:///chinook And it just works!	2017-06-17 16:48:15 +02:00
Dimitri Fontaine	8f92cc5a7d	Fix our SQL parser. In order to support custom SQL files with several queries and psql like advanced \i feature, we have our own internal SQL parser in pgloader. The PostgreSQL variant of SQL is pretty complex and allows dollar-quoting and other subtleties that we need to take care of. Here we fix the case when we have a dollar sign ($) as part of a single quoted text (such as a regexp), so both not a new dollar-quoting tag and a part of a quoted text being read. In passing we also fix reading double-quoted identifiers, even when they contain a dollar sign. After all the following is totally supported by PostgreSQL: create table dollar("$id" serial, foo text); select "$id", foo from dollar; Fix #561.	2017-05-29 12:42:12 +02:00
Dimitri Fontaine	8254d63453	Fix incorrect per-table total time metrics. The concurrency nature of pgloader made it non obvious where to implement the timers properly, and as a result the tracking of how long it took to actually transfer the data was... just wrong. Rather than trying to measure the time spent in any particular piece of the code, we now emit "start" and "stop" stats messages to the monitor thread at the right places (which are way easier to find, in the worker threads) and have the monitor figure out how long it took really. Fix #506.	2017-04-30 18:09:50 +02:00
Dimitri Fontaine	20ea1d78c4	Improve default summary readability. Now that we have fixed the output of the per-table total timing, we can only show that timing by default. With more verbosity pgloader will add the extra columns, and in computer oriented formats (json, csv, copy) all the details are always provided of course. See #506.	2017-04-30 18:09:50 +02:00
Dimitri Fontaine	940fc63a5e	Distribute root-dir to all threads. The creation of the reject and data files didn't happen in the right root-dir setting for lack of sending the main value to the worker threads.	2017-03-18 18:51:22 +01:00
Dimitri Fontaine	3fac222432	Fix MSSQL index column names quoting. We have to pay attention that column names in MS SQL don't follow the same rules as in PostgreSQL and may e.g. begin with numbers. Apply identifier case and rules to index column names too.	2017-03-03 21:30:58 +01:00
Dimitri Fontaine	9e2b95d9b7	Implement support for PostgreSQL storage parameters. In PostgreSQL it is possible at CREATE TABLE time to set some extra storage parameters, the most useful of them in the context of pgloader being the FILLFACTOR. For the setting to be useful, it needs to be positionned at CREATE TABLE time, before we load the data. The BEFORE LOAD clause of the pgloader command allows to run SQL scripts that will be executed before the load, and even before the creation of the target schema when pgloader does that, which is nice for other use case. Here we implement a new `ALTER TABLE` rule that one can set in the pgloader command in order to change storage parameters at CREATE TABLE time: ALTER TABLE NAMES MATCHING ~/\./ SET (fillfactor='40') Fix #516.	2017-02-25 21:49:06 +01:00
Dimitri Fontaine	1d025bcd5a	Fix log levels. It looks like we missed the INFO level messages in the DEBUG output.	2017-01-23 21:52:38 +01:00
Dimitri Fontaine	21a10235db	Refrain from issuing the summary twice. Now that we have a proper flush system for reporting the summary at the proper time (see 7c5396f0975be405910d66f4b5aedc89acd75c1d), refrain from also taking care of the reporting when stopping the monitor. Adapt the regression driver code to flush the summary after loading the expected data, which also provides better output. When the summary output is sent to a file, that would also create a backup file and replace our summary with an empty new file at monitor stop... Fixes #499.	2017-01-03 23:07:58 +01:00
Dimitri Fontaine	381ba18b50	Add a new log level: SQL. This sits between NOTICE and INFO, allowing to have a complete log of the SQL queries sent to the server while avoiding the very verbose trafic of the DEBUG log level. See #498.	2017-01-03 22:27:17 +01:00
Dimitri Fontaine	7c5396f097	Review fatal errors handling. Make it so that fatal errors are printed only once, and when possible included in the usual log format as handled by our monitoring thread. Also, improve error and summary reporting when we load from several sources on the same command line. All this work has been triggered because of an edge case where the OS return value of the pgloader command was 0 (zero, success) although the given file on the command line does not exists. Fixes #486.	2016-11-27 23:58:50 +01:00
Dimitri Fontaine	d7d36c5766	Review identifier case :quote. We added some confution about who's responsible to quote the SQL obejct names in between src/utils/quoting.lisp and src/pgsql/pgsql-ddl.lisp and as a result some migrations from MySQL with identifier case set to quote where broken, as in #439. To fix, remove any use of the format directive ~s in the PostgreSQL ddl output methods: we consider that the quoting of ~s is to be decided in apply-identifier-case. We then use ~a instead of ~s. Fix #439.	2016-09-17 22:45:45 +02:00
Dimitri Fontaine	8fb542bc90	Improve INCLUDING rule matching for MySQL. In the MySQL source we have explicit support for both string equality and regexps for the INCLUDING and EXCLUDING clauses. This got broken when moved to be shared with the ALTER TABLE implementation, because we were no longer using the type system in the same way in all places. To fix, create new abstractions for strings and regexps and use those new structs in the proper way (thanks to defstruct and CLOS). Fixes #441.	2016-09-10 18:54:11 +02:00
Dimitri Fontaine	5b6adb02b0	Implement and use DROP ... IF EXISTS. In cases where we have a WITH include drop option, we are generating lots of SQL DROP statements. We may be running an empty target database or in other situations where the target object of the DROP command might not exists. Add support for that case.	2016-09-10 18:01:04 +02:00
Dimitri Fontaine	3569980378	Fix error reporting of catalogs. The internal catalog representation are deeply recursive in order to make it easy to traverse the catalog both downwards (catalog to schema to tables) and upward (table to its schema to its catalog). In consequence we need to set print-circles to non-nil when we're going to log the catalogs, so turn it to non-nil before generating the log messages. While at it, add logging of such catalogs in the :data log verbosity mode. The catalog output is very verbose, but it's easy to copy/paste it from a bug report into being a live object we can inspect in the REPL, thanks to Common Lisp notion of a reader and readable printer!	2016-08-30 23:25:35 +02:00
Dimitri Fontaine	a86a606d55	Improve existing PostgreSQL database handling. When loading data into an existing PostgreSQL catalog, we DROP the indexes for better performance of the data loading. Some of the indexes are UNIQUE or even PRIMARY KEYS, and some FOREIGN KEYS might depend on them in the PostgreSQL dependency tracking of the catalog. We used to use the CASCADE option when dropping the indexes, which hides a bug: if we exclude from the load tables with foreign keys pointing to tables we target, then we would DROP those foreign keys because of the CASCADE option, but fail to install them again at the end of the load. To prevent that from happening, pgloader now query the PostgreSQL pg_depend system catalog to list the “missing” foreign keys and add them to our internal catalog representation, from which we know to DROP then CREATE the SQL object at the proper times. See #400 as this was an oversight in fixing this issue.	2016-08-10 22:02:06 +02:00
Dimitri Fontaine	42c8012e94	Cleanup: remove the now unused file!	2016-08-05 11:39:16 +02:00
Dimitri Fontaine	2aedac7037	Improve our internal catalog representation. First, add index and foreign keys to the list of objects supported by the shared catalog facility, where is was only found in the pgsql schema specific package for historical raisons. Then also add to our catalog internal structures the notion of a trigger and a stored procedure, allowing for cleaner advanced default values support in the MySQL cast functions. Once we now have a proper and complete catalog, review the pgsql module DDL output function in terms of the catalog and rewrite the schema creation support so that it takes direct benefit of our internal catalogs representation. In passing, clean-up the code organisation of the pgsql target support module to be easier to work with. Next step consists of getting rid of src/pgsql/queries.lisp: this facility should be replaced by the usage of a target catalog that we fetch the usual way, thanks to the new src/pgsql/pgsql-schema.lisp file and list-all-* functions. That will in turn allow for an explicit step of merging the pre-existing PostgreSQL catalog when it's been created by other tools than pgloader, that is when migrating with the help of an ORM. See #400 for details.	2016-08-01 23:14:58 +02:00
Dimitri Fontaine	7af6c7ac41	Filter out incomplete foreign key definitions. It's possible that in MySQL a foreign key constraint definition is pointing to a non-existing table. In such a case, issue an error message and refrain from trying to then reinstall the faulty foreign key definition. The lack of error handling at this point led to a frozen instance of pgloader apparently, I think because it could not display the interactive debugger at the point where the error occurs. See #328, also #337 that might be fixed here.	2016-04-19 17:23:05 -04:00
Dimitri Fontaine	7fc0812f79	Can't reduce an empty list with the max function. The max function requires at least 1 argument to be given, and in the case where we have no table to load it then fails badly, as show here: CL-USER> (handler-case (reduce #'max nil) (condition (c) (format nil "~a" c))) "invalid number of arguments: 0" Of course Common Lisp comes with a very easy way around that problem: CL-USER> (reduce #'max nil :initial-value 0) 0 Fix #381.	2016-03-29 21:02:31 +02:00
Dimitri Fontaine	177f48863b	Fix regression testing. It's been broken by a recent commit where we did force the internal table representation to always be an instance of the table structure, which wasn't yet true for regression testing. In passing, re-indent a large portion of the function, which accounts for most of the diff.	2016-03-27 21:28:51 +02:00
Dimitri Fontaine	b1d4e94f2a	Fix integer parsing support for SQLite. The function needs to return a string to be added to the COPY stream, we still need to make sure whatever given here looks like an integer. Given the very dynamic nature of data types in SQLite, the integer-to-string function was already a default now, but failed to be published before in its fixed version, somehow.	2016-03-27 20:42:40 +02:00
Dimitri Fontaine	45924be87d	Add support for MS SQL newid() function. The newid() function seems to be equivalent to the newsequentialid() one if I'm to believe issue #204, so let's just add that assumption in the code. Fix #204.	2016-03-27 01:09:22 +01:00
Dimitri Fontaine	35155654df	Allow to ALTER TABLE ... IN SCHEMA. That brings the ALTER TABLE feature to MS SQL source.	2016-03-26 20:50:05 +01:00
Dimitri Fontaine	fcc6e8f813	Implement ALTER SCHEMA ... RENAME TO... That's only available for MS SQL as of now, as it's the only source database we have where the notion of a schema makes sense. Fix #224.	2016-03-26 20:25:03 +01:00
Dimitri Fontaine	4155d06ae5	Improve support for MS SQL multicolumn indexes. Once more we can't use an aggregate over a text column in MS SQL to build the index definition from its catalog structure, so we have to do that in the lisp part of the code. Multi-column indexes are now supported, but filtered indexes still are a problem: the WHERE clause in MS SQL is not compatible with the PostgreSQL syntax (because of [names] and type casting. For example we cast MS SQL bit to PostgreSQL boolean, so WHERE ([deleted]=(0)) should be translated to WHERE not deleted And the code to do that is not included yet. The following documentation page offers more examples of WHERE expression we might want to support: https://technet.microsoft.com/en-us/library/cc280372.aspx WHERE EndDate IS NOT NULL AND ComponentID = 5 AND StartDate > '01/01/2008' EndDate IN ('20000825', '20000908', '20000918') It might be worth automating the translation to PostgreSQL syntax and operators, but it's not done in this patch. See #365, where the created index will now be as follows, which is a problem because of being UNIQUE: some existing data won't reload fine. CREATE UNIQUE INDEX idx_<oid>_foo_name_unique ON dbo.foo (name, type, deleted);	2016-03-18 11:01:06 +01:00
Dimitri Fontaine	3e8b7df0d3	Improve column formatting. Have a pretty-print option where we try to be nice for the reader, and don't use it in the CAST debug messages. Also allow working with the real maximum length of column names rather than hardcoding 22 cols...	2016-03-16 21:46:41 +01:00
Dimitri Fontaine	f1fe9ab702	Assorted fixes to MS SQL support. Having been given a test instance of a MS SQL database allows to quickly fix a series of assorted bugs related to schema handling of MS SQL databases. As it's the only source with a proper notion of schema that pgloader supports currently, it's not a surprise we had them. Fix #343. Fix #349. Fix #354.	2016-03-16 21:43:04 +01:00
Dimitri Fontaine	57f7fd1d4e	Find foreign keys with #'string= by default. Blind attempt at fixing #343 and #330, which now is on at the same level.	2016-03-09 16:33:44 +01:00
Dimitri Fontaine	c724018840	Implement ALTER TABLE clause for MySQL migrations. The new ALTER TABLE facility allows to act on tables found in the MySQL database before the migration happens. In this patch the only provided actions are RENAME TO and SET SCHEMA, which fixes #224. In order to be able to provide the same option for MS SQL users, we will have to make it work at the SCHEMA level (ALTER SCHEMA ... RENAME TO ...) and modify the internal schema-struct so that the schema slot of our table instances are a schema instance rather than its name. Lacking MS SQL test database and instance, the facility is not yet provided for that source type.	2016-03-06 21:51:33 +01:00
Dimitri Fontaine	486be8c068	SQLite integer default values might be quoted. Fix #351 by having a new transformation function to process SQLite integers, that may be quoted...	2016-03-03 14:59:27 +01:00
Dimitri Fontaine	eaa5807244	Adapt to CURRENT_TIMESTAMP(x) default values. We target CURRENT_TIMESTAMP as the PostgreSQL default value for columns when it was different before on the grounds that the type casting in PostgreSQL is doing the job, as in the following example: pgloader# create table test_ts(ts timestamptz(6) not null default CURRENT_TIMESTAMP); CREATE TABLE pgloader# insert into test_ts VALUES(DEFAULT); INSERT 0 1 pgloader# table test_ts; ts ------------------------------- 2016-02-24 18:32:22.820477+01 (1 row) pgloader# drop table test_ts; DROP TABLE pgloader# create table test_ts(ts timestamptz(0) not null default CURRENT_TIMESTAMP); CREATE TABLE pgloader# insert into test_ts VALUES(DEFAULT); INSERT 0 1 pgloader# table test_ts; ts ------------------------ 2016-02-24 18:32:44+01 (1 row) Fix #341.	2016-02-24 18:30:16 +01:00
Dimitri Fontaine	c108b85290	Allow package prefix in CAST ... USING clause. Also, in passing, ass a new transformation function for MySQL allowing to transform from varbinary to text.	2016-02-04 16:09:22 +01:00
Dimitri Fontaine	782561fd4e	Handle default value transforms errors, fix #333 . It turns out that MySQL catalog always store default value as strings even when the column itself is of type bytea. In some cases, it's then impossible to transform the expected bytea from a string. In passing, move some code around to fix dependencies and make it possible to issue log warnings from the default value printing code.	2016-02-03 12:27:58 +01:00
Dimitri Fontaine	76668c2626	Review package dependencies. The decision to use lots of different packages in pgloader has quite strong downsides at times, and the manual managment of dependencies is one of the, in particular how to avoid circular ones.	2016-01-31 18:42:01 +01:00
Dimitri Fontaine	64ab4d28dc	Error out when using ignored options. In the theory that it's a better service to the user to refuse doing anything at all rather than ignore his/her commands, print out FATAL errors when options are used that are incompatible with a load command file. See #327 for a case where this did happen. In passing, tweak our report code to avoid printing the footer when we didn't print anything at all previously.	2016-01-25 11:46:36 +01:00
Dimitri Fontaine	b2ec66c84b	Force external-format of the logs files, see #328 . In the issue #328 the --debug level output is not helpful because of an encoding error in the logfile. Let's see about forcing the log file external format to utf-8 then.	2016-01-20 21:53:13 +01:00
Dimitri Fontaine	327745110a	MySQL bytea default value can be "". Fix 291. Thanks to a reproducable test case we can see that MySQL default for a varbinary column is an empty string, so tweak the transform function byte-vector-to-bytea in order to cope with that.	2016-01-18 21:55:01 +01:00

1 2

100 Commits