From Python to Postgres in 2023
Table of Contents
- Defining the problem
- The baseline
- Executemany
- Copy implementations
- A new option for data staging in 2023 - DuckDB
- Conclusions
Back in 2019, Postgres developer/blogger Haki Benita wrote a long post about the fastest way to load data into Postgresql using python and psycopg2. With three years of improvements to psycopg2, and a newer library psycopg (“psycopg3” is installed/imported as psycopg with no number), I decided to revisit some of his tests and see if the fastest data load method has changed. I also added a new DuckDB-based test case to see how much overhead one of my new favorite toys adds to data cleaning and loading.
Defining the problem
The ground rules
I’m reimplementing the code from the 2019 post in both psycopg2 and psyopg. There are other choices out there these days, but if you’ve been using psycopg2, psycopg is your likely upgrade path since it will require fewer changes than coding to a brand new API.
When I mention the performance of an option in 2019, I copied that from the original blog post: I did not try to install four year old versions of python and psycopg. So, there are some apples and oranges comparisons here. My laptop is an older (2014) MacBook Pro, so I’m not throwing 2023 top-of-the-line hardware vs 2019 hardware. However, I’m using Postgres 15, so I’ve upgraded my database software as well as my python version and libraries.
The data
I’m only going to touch on the dataset briefly, for more details please read refer to the links to the original post.
There is a list of 32500 beers, generated by by querying an API over the network for 325 beers, then making a total of 100 copies of the list. This list is then stored in memory, which takes network latency out of the performance comparisons.
The data loaders can make a copy of this list (increasing their memory use) but if they iterate one item at a time, the memory used by the full list does not “count”. This is meant to simulate streaming the data vs dowloading and processing it in bulk.
The destination table will only take 17 keys from the input data and two of those keys (first_brewed and volume) need some minor data cleaning.
The destination table
I’m using the same destination table definition as in the original blog post, but the “cursor” in the create_staging_table function could be from psycopg2 or psycopg.
The profiler
I’m also stealing the time and memory profiler from the original
post.
The profiler wrapper runs each implementation twice - once to profile time and
and again to profile memory use.
One thing I did notice during my tests is that the memory usage was much more variable between different runs of the same test case, especially if I didn’t restart my notebook between trials.
The baseline
The baseline test case is to insert rows one by one. In 2019, this took two minutes to accomplish with psycopg2. Using the 2023 version of psycopg2, things got much better, and the new psycopg is better still.
psycopg2 - 2019
insert_one_by_one()
Time 128.8
Memory 0.08203125
psycopg2 - 2023
insert_one_by_one()
Time 4.568
Memory 0.125
psycopg - 2023
insert_one_by_one()
Time 3.683
Memory 0.0078125
Executemany
The next test case in the original blog post used cursor.executemany to send many rows in one function call. One flavor marshalls the entire list, the other uses a generator to send one row at a time to reduce memory footprint. The executemany and executemany_with_iterator functions can be run the same code for psycopg2 and psycopg. Psycopg2-2023 beats psycopg2-2019 again, and the psycopg implementation shows further improvement, this time by a factor of almost 4x faster!
psycopg2 - 2019
insert_executemany()
Time 124.7
Memory 2.765625
insert_executemany_iterator()
Time 129.3
Memory 0.0
psycopg2 - 2023
insert_executemany()
Time 4.279
Memory 16.70703125
insert_executemany_iterator()
Time 4.399
Memory 0.0
psycopg - 2023
insert_executemany()
Time 1.24
Memory 3.9453125
insert_executemany_iterator()
Time 1.231
Memory 0.0
Copy implementations
The original blog post also experimented with the psycopg2’s execute_values and execute_batch functions, including trying with different batch sizes. These approaches do not have a direct analog in psycopg, and there doesn’t seem to be a need for them given the massive improvement in executemany. However, if you are porting code from psycopg2, be aware that these functions no longer exist.
I’m going to skip ahead to the fastest implementations, the ones based on the Postgresql COPY function.
Copying with psycopg2
The fastest approach in 2019 was to use psycopg2’s copy functions. There are two implementations,
one that uses a StringIO object to generate a csv file in memory, which is then copied to
the table as a blob. The other implementation uses a custom subclass of io.TextIOBase
to
avoid holding the entire set of rows in memory, improving speed and memory use at the expense of
additional code complexity.
Copying with psycopg
I could not use the same code for psycopg:
Psycopg2’s cursor.copy_from
method is replaced
by a new function cursor.copy
which returns an object to which you write
the data to be copied.
This means you don’t really need to create a separate file-like-object in order
to accomplish the copy unless you need one for some other reason.
In order to force the rewritten copy_stringio method to send the entire file at once (to compare apples to apples), I did go ahead and creat a StringIO object. However, for copy_string_iterator I just wrote directly to psycopg’s copy object. This simplifies the code from the 2019 implementation and still saves time and memory.
But, even the string_iterator method is not using the new copy object to it’s fullest potential. The clean_csv_value function that handles NULLs and string-ify-ing column values is no longer necessary in psycopg: The copy object can take a tuple of all the data for a row and handle any None’s for you. Furthermore, if you tell psycopg the data types of the destination columns, the library will use binary transfers under the covers where possible, saving the overhead of converting column values to strings.
I call this new flavor, copy_tuple_operator:
When doing things the string-munging way, psycopg doesn’t offer a big gain over psycopg2. However, there is a definite performance improvement when using the “tuple” method. Plus, the tuple method leaves the string-hacking stuff to the library, which means my application code is simpler and easier to maintain.
Psycopg2 - 2019
copy_stringio()
Time 0.6274
Memory 99.109375
copy_string_iterator(size=1024)
Time 0.4596
Memory 0.0
Psycopg2 - 2023
copy_stringio()
Time 0.4525
Memory 9.765625
copy_string_iterator()
Time 0.3442
Memory 0.0
Psycopg - 2023
copy_stringio()
Time 0.49
Memory 16.1
copy_string_iterator()
Time 0.3413
Memory 0.01
copy_tuple_iterator()
Time 0.2397
Memory 0.07
A new option for data staging in 2023 - DuckDB
But what if we don’t want to stage inside of Postgres at all? DuckDB is becoming a favorite swiss-army-knife for combining data inside flat files, in-memory dataframes, and with the Postgres scanner data in a database. I can do a lot of data cleaning by bringing my query-engine to my data rather than bringing my data to my query-engine.
I decided to try just bulk copying data from json or parquet flat files to the same tables as the above tests.
First, let’s dump our data into some flat files:
Next, the code to load the data is the same, regardless of which flat file we are reading from. I used the copy method that was the fastest from the previous experiement.
Psycopg + Duckdb Results
Considering that I’m adding additional file I/O and another data query engine to the workflow, it’s not surprising that it takes longer to do things this way. Working with the binary parquet file format is faster than working with the text json file. The copy from json implementation needs to deserialize the json data, which was not included in the timings of the other test cases where I passed in a list of pre-deserialized dictionaries.
I’m also not counting the work to write out the staged data to a file when setting up the test case, which would increase overhead as well. Considering that we took out network latency overhead from the other test cases, this seems fair.
By reusing the tables from the previous tests, I’m loading into an unlogged table. When using duckdb as a final data cleaning and transformation tool, you might be loading into a logged table, with index updates and other overhead.
Which is a long way of saying, it’s not clear how “good” these numbers are: It’s not a real world example of what I really use this tool for day-to-day. But for me, one of the appeals of DuckDB is how much easier it is to express certain types of data munging versus my other options. As long as the additional overhead is reasonable, I’m always a fan of things that are simple and expressive, and the results don’t seem unreasonable to me at this point.
copy_with_duckdb(whichfile=beers.json)
Time 0.77
Memory 41.5
copy_with_duckdb(whichfile=beers.parquet)
Time 0.3496
Memory 8.96
This blog post is getting super-long, but I will say that I discovered while playing with the DuckDB option, that what I selected against makes quite a difference. I ended up selecting directly from ‘whichfile’, which had the least overhead versus first creating a DuckDB view (a little worse) or a DuckDB table (noticably worse) of my file, and then selecting from that instead.
Conclusions
- You should move from psycopg2 to psycopg for doing bulk inserts
- Use of execute and executemany does not change much, but calls to execute_values and execute_batch in psycopg2 will need to be converted to executemany in psycopg.
- The new psycopg copy object may work with minor changes versus the psycopg2 code, but the best performance will come from using the new copy object to its fullest potential. In addition, the improved copy API means you code may be simpler and easier to write/maintain.
- If you are staging your data with DuckDB, I’d love to hear about methods you’ve been using to profile the overhead or any performance tips and tricks! There is clearly more than one blog post to be written on this topic still…
Finally, if you’d like to play with this yourself, my source code is here