pgloader

mirror of https://github.com/dimitri/pgloader.git synced 2025-08-12 17:26:58 +02:00

Author	SHA1	Message	Date
adrian	3da4422fb5	few typos in comments/strings	2023-05-01 10:18:30 +02:00
Dimitri Fontaine	5ecf04acb9	Implement null if support as a WITH option. This gives a default "null if" option to all the input columns at once, and it's still possible to override the default per column. In passing, fix project-fields declarations that SBCL now complains about when they're not true, such as declaring a vector when we might have :null or nil. As a result, remove the (declare (optimize speed)) in the generated field processing code.	2018-11-13 21:41:27 +01:00
Dimitri Fontaine	207cd82726	Improve SQLite type names parsing. Allow spaces in more random places, as SQLite doesn't seem to normalize the user input. Fixes #548 again.	2018-11-07 11:01:06 +01:00
Dimitri Fontaine	1b150182dc	Fix cl-csv delimiter type. Travis spotted a bug with CCL that I failed to see, and that happens with Clozure-CL but not with SBCL apparently: 2018-07-03T21:04:11.053795Z FATAL The value "\\\"", derived from the initarg :DELIMITER, can not be used to set the value of the slot CL-CSV::DELIMITER in #<CL-CSV::READ-DISPATCH-TABLE-ENTRY #x30200143DDCD>, because it is not of type (VECTOR (OR (MEMBER T NIL) CHARACTER)). To fix, prefer the syntax #(#\\ #\") rather than "\\\"".	2018-07-04 01:32:40 +02:00
Dimitri Fontaine	5c10f12a07	Fix duplicate package names. In a previous commit we re-used the package name pgloader.copy for the now separated implementation of the COPY protocol, but this package was already in use for the implementation of the COPY file format as a pgloader source. Oops. And CCL was happily doing its magic anyway, so that I've been blind to the problem. To fix, rename the new package pgloader.pgcopy, and to avoid having to deal with other problems of the same kind in the future, rename every source package pgloader.source.<format>, so that we now have pgloader.source.copy and pgloader.pgcopy, two visibily different packages to deal with. This light refactoring came with a challenge tho. The split in between the pgloader.sources API and the rest of the code involved some circular depencendies in the namespaces. CL is pretty flexible here because it can reload code definitions at runtime, but it was still a mess. To untangle it, implement a new namespace, the pgloader.load package, where we can use the pgloader.sources API and the pgloader.connection and pgloader.pgsql APIs too. A little problem gave birth to quite a massive patch. As it happens when refactoring and cleaning-up the dirt in any large enough project, right? See #748.	2018-02-24 19:24:22 +01:00
Dimitri Fontaine	f86371970f	Review the pgloader COPY implementation further. Refactor file organisation further to allow for adding a “direct stream” option when the on-error-stop behavior has been selected. This happens currently by default for databases sources. Introduce the new WITH option “on error resume next” which forces the classic behavior of pgloader. The option “on error stop” already existed, its implementation is new. When this new behavior is activated, the data is sent to PostgreSQL directly, without intermediate batches being built. It means that the whole operation fails at the first error, and we don't have any information in memory to try replaying any COPY of the data. It's gone. This behavior should be fine for database migrations as you don't usually want to fix the data manually in intermediate files, you want to fix the problem at the source database and do the whole dance all-over again, up until your casting rules are perfect. This patch might also incurr some performance benenits in terms of both timing and memory usage, though the local testing didn't show much of anything for the moment.	2018-01-24 22:45:23 +01:00
Dimitri Fontaine	572f6a3dbe	Fix CSV separator parsing. The previous patch introduced parser conflicts and we couldn't parse some expressions any more, such as the following: fields escaped by '\', It's now possible to represent single quote as either '''', '\'', or '0x27' and we still can parse '\' as being a single backslash character. See #705.	2018-01-14 15:33:47 +01:00
Dimitri Fontaine	81be9ae60e	Implement support for \' as the CSV separator. The option "fields optionally enclosed by" was missing a way to easily specify a single quote as the quoting character. Add '\'' to the existing solution '0x27' which isn't as friendly. See #705.	2017-12-26 21:04:06 +01:00
Dimitri Fontaine	38712d98e0	Fix regression testing. Previous patch made regression failures obvious that were hidden by strange bugs with CCL. One such regression was introduced in commit `ab7e77c2d0` where we played with the complex code generation for field projection, where the following two cases weren't cleanly processed anymore: column text using "constant" column text using "field-name" In the first case we want to load a user-defined constant in the column, in the second case we want to load the value of the field "field-name" in the column --- we just have different source and target names. Another regression was introduced in the recent commit `01e5c23763` where the create-table function was called too early, before we have fetched pgsql-reserved-keywords. As a consequence table names weren't always properly quoted as shown in the test/csv-header.load file which targets a table named "group". Finally, skip the test/dbf.load regression test when using CCL as this environment doesn't have the necessary CP850 code page / encoding.	2017-09-09 00:51:07 +02:00
Dimitri Fontaine	01e5c23763	Add support for explicit TARGET TABLE clause in load commands. It used to be that you would give the target table name as an option to the PostgreSQL connection string, which is untasteful: load ... into pgsql://user@host/dbname?tablename=foo.bar ... Or even, for backwards compatibility: load ... into pgsql://user@host/dbname?foo.bar ... The new syntax makes provision for a separate clause for the target table name, possibly schema-qualified: load ... into pgsql://user@host/dbname target table foo.bar ... Which is much better, in particular when used together with the target columns clause. Implementing this seemingly quite small feature had impact on many parsing related features of pgloader, such as the regression testing facility. So much so that some extra refactoring got into its way here, around the lisp-code-for-loading-from-<source> functions and their usage in `load-data'. While at it, this patch simplifies a lot the `load-data' function by making a good use of &allow-other-keys and :allow-other-keys t. Finally, this patch splits main.lisp into main.lisp and api.lisp, with the latter intended to contain functions for Common Lisp programs wanting to use pgloader as a library. The API itself is still the same as before this patch, tho. Just in another file for clarity.	2017-08-25 01:57:54 +02:00
Dimitri Fontaine	f719d2976d	Implement a template system for pgloader commands. This feature has been asked several times, and I can't see any way to fix the GETENV parsing mess that we have. In this patch the GETENV support is retired and replaced with a templating system, using the Mustache syntax. To get back the GETENV feature, our implementation of the Mustache template system adds support for fetching the template variable values from the OS environment. Fixes #555, Fixes #609. See #500, #477, #278.	2017-08-16 01:33:11 +02:00
Dimitri Fontaine	471f2b6d88	Implement automagic guessing of CSV parameters. As we know how many columns we expect from the input file, it's possible to read a sample (10 lines as of this patch) and try many different CSV reader parameters combinations until we find one that works: it returns the right number of fields. It is still possible of course to specify parameters on the command line or in a load file if necessary, but it makes the simple case even simpler. As simple as: pgloader file.csv pgsql:///pgloader?tablename=target	2017-07-07 02:16:53 +02:00
Dimitri Fontaine	6d66280fa5	Review parallelism and memory behavior. The previous patch made format-vector-row allocate its memory in one go rather than byte after byte with vector-push-extend. In this patch we review our usage of batches and parallelism. Now the reader pushes each row directly to the lparallel queue and writers concurrently consume from it, cook batches in COPY format, and then send that chunk of data down to PostgreSQL. When looking at runtime profiles, the time spent writing in PostgreSQL is a fraction of the time spent reading from MySQL, so we consider that the writing thread has enough time to do the data mungling without slowing us down. The most interesting factor here is the memory behavor of pgloader, which seems more stable than before, and easier to cope with for SBCL's GC. Note that batch concurrency is no more, replaced by prefetch rows: the reader thread no longer build batches and the count of items in the reader queue is now a number a rows, not of batches of them. Anyway, with this patch in I can't reproduce the following issues: Fixes #337, Fixes #420.	2017-06-27 23:10:33 +02:00
Dimitri Fontaine	c6b634caad	Provide "on error stop" as a WITH option. As seen in #546 it would be easier to be able to specify the option in the load command directly rather than only at the command line. Here we go!	2017-06-01 16:43:09 +02:00
Dimitri Fontaine	526fafb4b7	Allow quoting identifiers in db uri tablename. As shown in #476, it is sometimes needed to be able to quote the identifier names even when loading from a file, that is when specifying the target table name in the database uri. To that ends, allow the option "identifier case" to be used in the file based cases too. Fixes #476.	2016-11-13 22:14:48 +01:00
Dimitri Fontaine	c2c98b8b42	Allow any character in a quoted CSV field name. We used to force overly strict rules for a quoted field name in a CSV load file, now accept any character but a quote to be part of the field name. Fixes #416.	2016-08-07 20:35:37 +02:00
leonardsson	826d975985	Fix bug in max-parallel-create-index (#398 ) Fixes #395	2016-05-05 22:00:38 +02:00
Dimitri Fontaine	42e9e521e0	Add option "max parallel create index". By default, pgloader will start as many parallel CREATE INDEX commands as the maximum number of indexes you have on any single table that takes part in the load. As this number might be so great as to exhaust the target PostgreSQL server (e.g. maintenance_work_mem), we add an option to limit that to something reasonnable when the source schema isn't. Fix #386 in which 150 indexes are found on a single source table.	2016-04-11 17:40:52 +02:00
Dimitri Fontaine	7dd69a11e1	Implement concurrency and workers for files sources. More than the syntax and API tweaks, this patch also make it so that a multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be loaded with several files in the group in parallel. To that effect, tweak again the md-connection and md-copy implementations.	2016-01-16 22:53:55 +01:00
Dimitri Fontaine	9e4938cea4	Implement PostgreSQL catalogs data structure. In order to share more code in between the different source types, finally have a go at the quite horrible mess of anonymous data structures floating around. Having a catalog and schema instances not only allows for code cleanup, but will also allow to implement some bug fixes and wishlist items such as mapping tables from a schema to another one. Also, supporting database sources having a notion of "schema" (in between "catalog" and "table") should get easier, including getting on-par with MySQL in the MS SQL support (materialized views has been asked for already). See #320, #316, #224 for references and a notion of progress being made. In passing, also clean up the copy-databases methods for database source types, so that they all use a fetch-metadata generic function and a prepare-pgsql-database and a complete-pgsql-database generic function. Actually, a single method does the job here. The responsibility of introspecting the source to populate the internal catalog/schema representation is now held by the fetch-metadata generic function, which in turn will call the specialized versions of list-all-columns and friends implementations. Once the catalog has been fetched, an explicit CAST call is then needed before we can continue. Finally, the fields/columns/transforms slots in the copy objects are still being used by the operative code, so the internal catalog representation is only used up to starting the data copy step, where the copy class instances are then all that's used. This might be refactored again in a follow-up patch.	2015-12-30 21:53:01 +01:00
Dimitri Fontaine	41e9eebd54	Rationalize common generic API implementation. When devising the common API, the first step has been to implement specific methods for each generic function of the protocol. It now appears that in some cases we don't need the extra level of flexibility: each change of the API has been systematically reported to all the specific methods, so just use a single generic definition where possible. In particular, introduce new intermediate class for COPY subclasses allowing to share more common code in the methods implementation, rather than having to copy/paste and maintain several versions of the same code. It would be good to be able to centralize more code for the database sources and how they are organized around metadata/import-data/complete schema, but it doesn't look obvious how to do it just now.	2015-10-05 21:25:21 +02:00
Dimitri Fontaine	7b9b8a32e7	Move sexp parsing into its own file. After all, it's shared between the CSV command parsing and the Cast Rules parsing. src/parsers/command-csv.lisp still contains lots of facilities shared between the file based sources, will need another series of splits.	2015-10-05 11:39:44 +02:00
Dimitri Fontaine	c880f86bb6	Fix user defined casting rules. Commit `598c860cf5` broke user defined casting rules by interning "precision" and "scale" in the pgloader.user-symbols package: those symbols need to be found in the pgloader.transforms package instead. Luckily enough the infrastructure to do that was already in place for cl:nil.	2015-10-05 02:11:23 +02:00
Dimitri Fontaine	96a33de084	Review the stats and reporting code organisation. In order to later be able to have more worker threads sharing the load (multiple readers and/or writers, maybe more specialized threads too), have all the stats be managed centrally by a single thread. We already have a "monitor" thread that get passed log messages so that the output buffer is not subject to race conditions, extend its use to also deal with statistics messages. In the current code, we send a message each time we read a row. In some future commits we should probably reduce the messaging here to something like one message per batch in the common case. Also, as a nice side effect of the code simplification and refactoring this fixes #283 wherein the before/after sections of individual CSV files within an ARCHIVE command where not counted in the reporting.	2015-10-05 01:46:29 +02:00
Dimitri Fontaine	598c860cf5	Improve user code parsing, fix #297 . To be able to use "t" (or "nil") as a column name, pgloader needs to be able to generate lisp code where those symbols are available. It's simple enough in that a Common Lisp package that doesn't :use :cl fullfills the condition, so intern user symbols in a specially crafted package that doesn't :use :cl. Now, we still need to be able to run transformation code that is using the :cl package symbols and the pgloader.transforms functions too. In this commit we introduce a heuristic to pick symbols either as functions from pgloader.transforms or anything else in pgloader.user-symbols. And so that user code may use NIL too, we provide an override mechanism to the intern-symbol heuristic and use it only when parsing user code, not when producing Common Lisp code from the parsed load command.	2015-09-21 13:23:21 +02:00
Dimitri Fontaine	b78bb6dd31	Allow quoted field names to contain spaces, fix #285 . Given a fully quoted field name, there should be no restriction about using spaces in between the quotes, but the parser used to choke on that case.	2015-09-01 14:32:50 +02:00
Dimitri Fontaine	04aa743eb7	Cleanup file based "connections". When the notion of a connection class with a generic set of method was invented, the very flexible specification formats available for the file based sources where not integrated into the new connection system. This patch provides a new connection class md-connection with a specific sub-protocol (after opening a connection, the caller is supposed to loop around open-next-stream) so that it's possible to both properly fit into the connection concept and to better share the code in between our three implementation (csv, copy, fixed).	2015-08-24 16:33:00 +02:00
Dimitri Fontaine	ea35eb575d	Implement --dry-run option, fix #264 . The dry run option will currently only check database connections, but as that happens after having correctly parsed the load file, it allows to also check that the command file is correct for the parser. Note that the list load-data API isn't subject to the dry-run method. In passing, we add some more API entry points to the connection objects and we should actually clean the code base to use the new QUERY generic all over the place. It's for another patch tho.	2015-08-22 16:23:47 +02:00
Dimitri Fontaine	54e29773d7	Fix index creation reporting, see #251 . The new option 'drop indexes' reuses the existing code to build all the indexes in parallel but failed to properly account for that fact in the summary report with timings. While fixing this, also fix the SQL used to re-establish the indexes and associated constraints to allow for parallel execution, the ALTER TABLE statements would block in ACCESS EXCLUSIVE MODE otherwise and make our efforts vain.	2015-07-18 23:06:15 +02:00
Dimitri Fontaine	49bf7e56f2	Implement a "drop indexes" option in CSV mode, fix #251 . When loading against a table that already has index definitions, the load can be quite slow. Previous commit introduced a warning in such a case. This commit introduces the option "drop indexes" that is not used by default. When this option is used, pgloader drops the indexes before loading the data then create the indexes again with the same definitions as before. All the indexes are created again in parallel to optimize performances. Only primary key indexes can't be created in parallel, so those are created in two steps (create unique index then alter table).	2015-07-16 12:22:58 +02:00
Dimitri Fontaine	d75c100399	Expose cl-csv escape mode option, fix #80 . Some CSV files are using the CSV escape character internally in their fields. In that case we enter a parsing bug in cl-csv where backtracking from parsing the escape string isn't possible (or at least unimplemented). To handle the case, change the quote parameter from \" to just \ and let cl-csv use its escape-quote mechanism to decide if we're escaping only separators or just any data. See https://github.com/AccelerationNet/cl-csv/issues/17 where the escape mode feature was introduced for pgloader issue #80 already.	2015-06-25 14:10:36 +02:00
Dimitri Fontaine	bffec4cc63	Allow for more options in the CSV escape character, fix #38 . To allow for importing JSON one-liners as-is in the database it can be interesting to leverage the CSV parser in a compatible setup. That setup requires being able to use any separator character as the escape character.	2015-05-22 12:31:06 +02:00
Dimitri Fontaine	abbc105c41	Implement CSV headers support. Some CSV files are given with an header line containing the list of their column names, use that when given the option "csv header". Note that when both "skip header" and "csv header" options are used, pgloader first skip as many required lines and then uses the next one as the csv header. Because of temporary failure to install the `ronn` documentation tool, this patch only commits the changes to the source docs and omits to update the man page (pgloader.1). A following patch is intended to be pushed that fixed that. See #236 which is using shell tricks to retrieve the field list from the CSV file itself and motivated this patch to finally get written.	2015-05-21 12:55:23 +02:00
Dimitri Fontaine	dfb4cc2049	Allow dollars in CSV fields names, fix #236 .	2015-05-21 12:51:39 +02:00
Mark Lee	dc04b40836	Accept periods in CSV field names Periods are allowed in PG column names as well.	2015-05-15 07:22:07 -07:00
Dimitri Fontaine	0068a45e1c	Fix parsing of qualified target table names, see #186 . We used to parse qualified table names as a simple string, which then breaks attempts to be smart about how to quote idenfifiers. Some sources are known to accept dots in quoted table names and we need to be able to process that properly without tripping on qualified table names too late. Current code might not be the best approach as it's just using either a cons or a string for table names internally, rather than defining a proper data structure with a schema and a name slot. Well, that's for a later cleanup patch, I happen to be lazy tonight.	2015-04-17 23:22:30 +02:00
Dimitri Fontaine	48f451bdbc	Implement the option to disable triggers when loading data. This option is dangerous and allows to skip ALL triggers when loading data against PostgreSQL. This includes foreign key constraints definitions and will allow loading data out of order. When using both the options "create no table" and "disable triggers" it will be possible to load data into a schema prepared by your favorite external tool, at the cost of not validating FK constraints. Use with care. Fix #167.	2015-02-19 15:05:10 +01:00
Dimitri Fontaine	40f3c4f769	Improve HTTP handling of CSV and Fixed data sources. In passing also allow --field to specify the whole field list, there's no point in forcing the user to have as many --field switches on the command line as they have columns in their data source file.	2014-12-27 17:02:19 +01:00
Dimitri Fontaine	302a7d402b	Refactor connection handling, and clean-up many things. That's the big refactoring patch I've been sitting on for too long. First, refactor connection handling to use a uniformed "connection" concept (class and generic functions API) everywhere, so that the COPY derived objects just use that in their :source-db and :target-db slots. Given that, we don't need no messing around with pgconn and myconn- and other special variables at all anywhere in the tree. Second, clean up some oddities accumulated over time, where some parts of the code didn't get the memo when new API got into place. Third, fix any other oddity or missing part found while doing those first two activities, it was long overdue anyway...	2014-12-26 21:50:29 +01:00
Dimitri Fontaine	65c2043694	Improve pgloader usage from the command line. Make it so that the following command line usages are accepted when using pgloader without a command file: ./build/bin/pgloader ./test/sqlite/sqlite.db postgresql:///pgloader ./build/bin/pgloader --set "search_path='sakila'" \ mysql://root@localhost/sakila \ postgresql:///sakila ./build/bin/pgloader --type csv \ --field id --field field \ --with truncate \ --with "fields terminated by ','" \ ./test/data/matching-1.csv \ postgres:///pgloader?matching It's now possible in most cases to just use command-line options, which should make the entry bar to pgloader much lower.	2014-12-23 02:40:13 +01:00
Dimitri Fontaine	fff756f95f	Refactor the command parser. Split its content into separate files, so that each is easier to maintain, and to make it easier also to add support for new sources.	2014-11-16 22:22:04 +01:00

41 Commits