from data import insight :
    About     Archive     Feed

A long documentation page worth reading

raw_entry

This meme floated across data engineering LinkedIn recently, and I have to say I disagree with throwing the docs into the fire here. It’s absolutely true that the reference page pd.read_csv has an overwhelming number of options – I counted 52 (including the deprecated ones), but all those options mean that this function can handle just about any file you want to throw at it!

If you’re not going to memorize all of those options, here are some of the ones I use most frequently:

  • usecols

    One easy way to speed up reading a csv file is to not process the columns you don’t need, rather than reading the whole file and dropping columns later. You can also designate one of the columns as the index with index_col (which, despite the singular name, can also take a list if you want a multicolumn index).

  • dtype

    Setting the datatypes of certain columns can save on memory or clarify whether a column represents a floating point, integer, or boolean value.

  • converters
  • on_bad_lines (especially setting it to a callback function)

    Writing your own per-column data converters and/or adding an on_bad_lines callback means you can load a file and do data cleaning or unit conversions in one step. The argument to your converter lambda function will be the raw string contents of the column, while the on_bad_lines callback will get a list of all the column values.

Many of the other options on that long documentation page are actually ways of doing simple conversions without having to type up your own lambda function: true_values and false_values convert values to booleans, and na_values marks values that are nulls/nans. Arguments parse_dates, infer_datetime_format, date_parser, day_first and cache_dates control automatically converting columns to dates/times, and thousands and decimal control how strings are converted to numbers.

  • nrows
  • chunksize

    These optiona are useful for large files to get a sniff or to process the data in chunks. nrows will read a subset of the data, while chunksize will turn the output of read_csv into an interator to process in chunks:

    for chunk in read_csv('my_file.csv`, chunksize=10_000):
        process_next_chunk(chunk)
    

A powerful function for such a common operation is worth spending the time to understand in depth. If the reference page is too terse, there more details and examples in the I/O guide here.

Tuning a Postgres BRIN index

I’m working on a new PostgreSQL database at work, and it’s just getting big enough where getting the indexes right is becoming more noticeable. I just scored a big “win” by adding a properly thought-out BRIN index to a particular table, but it took me a bit to get there.

What I didn’t know about BRIN could fill a blog post

The first half of this post are introductory and cover the obvious and better-documented aspects of the BRIN index. The second half is the subtle aspects I needed to figure out in order to use the index effectively for my use case.

BRIN indexes – the easy bits

Lots of blog posts/tutorials explain clearly what a BRIN index is and how to create and use one. If you know the basics about a BRIN index, you can skip to the next part of this post. If you aren’t familiar with this index type, and my brief explanations aren’t enough detail, you can google it and find plenty of basic explainers. Of course, the Postgresql docs are always a good source of information as well.

What is a BRIN index?

A BRIN index is a Block Range INdex. It is a lightweight way to index data in large tables that is “banded” in physical storage. For example, if your data is typically added chronologically and has a date column, the data layout on the table will be banded by date. Given the date on one row, the date on the next row is almost certainly the same date or the next day’s date.

It works by creating a data structure that stores, for each “block range,” the minimum and maximum value of the indexed column for that block of rows. When searching for the rows that match the indexed column, the query can go block-by-block as a first pass, rather than row-by-row. Only the matching blocks need to be checked row-by-row.

By default, BRIN indexes need to be maintained explicitly

The default behavior of a BRIN index is to NOT keep the index up to date with the table. This makes table inserts go faster, but means that the DBA must call the appropriate index maintenance functions or the index must be created with autosummarize = 1 .

Unindexed blocks have to be searched row-by-row. If you create a BRIN index with the default settings and never summarize it, the index will have no effect on your queries.

Brin indexes, the subtle bits

How many rows are in a block?

Or, why the default block size is almost certainly wrong for your particular table

The benefits of searching block-by-block are modulated by the number of rows in a block: Are you going by 10’s or 1000’s or 100,000’s?

You want the blocks to be big enough that searching block-by-block creates a noticeable benefit, and small enough that when you hit the case where you have to search a block row-by-row, you’re not walking through a million rows to only end up using three of them.

The index block size is specified in database pages. You will need to know the patterns in your table and your typical queries in order to pick a good number of rows to put in your blocks, and then you need to do the storage math to determine how many database pages that number of rows uses.

This is the query I came up with to determine a good value to use as my estimate for storing one row of my table:

/*
 * Estimate the amount of storage used by one row
 * of table 'my_table'
 */
SELECT
    24 + -- base overhead for any row
    SUM(
        CASE 

        WHEN attlen > 0 
        THEN attlen  -- storage size of column type when not NULL

        -- but attlen == -1 means the column can be TOASTed
        ELSE 18 -- overhead for a TOASTed column

        END
    )
FROM
         pg_tables pt
    JOIN pg_class pc
        ON pt.tablename = pc.relname
    JOIN pg_attribute pa
        ON pa.attrelid = pc.oid
WHERE
    pt.tablename = 'my_table'
;

Once you decide on your best estimate for the row size of your table, and your best estimate for the number of rows to put in a block, you need to figure out how many pages to put in a block and set the pages_per_range storage parameter. There are 8192 bytes in a page, so 8192 divided by row size gives you the rows per page. The desired rows per block divided by the rows per page gives you the desired number of pages per block.

The default number of pages for a BRIN index block is 128. For a narrow table, this may be too many. For a wide table, it may be too few. If you don’t do the math, you’ll never know.

Go ahead and throw in more columns!

On my umpteenth trip through the Postgres docs trying to get the best speed-up with my new index, I came across this:

A multicolumn BRIN index can be used with query conditions that involve any subset of the index’s columns. Like GIN and unlike B-tree or GiST, index search effectiveness is the same regardless of which index column(s) the query conditions use. The only reason to have multiple BRIN indexes instead of one multicolumn BRIN index on a single table is to have a different pages_per_range storage parameter. source

That’s very different from the other indexes I use frequently: B-tree indexes can be multicolumn, but because the tree is built in a particular column order, it’s hard to make a B-tree that is effective for more than one or two columns. Hash indexes are restricted to a single column.

With BRIN indexes, adding additional columns to the index costs very little! You don’t have to worry about column order, and Postgres will construct the multicolumn index efficiently so there is no reason to create two single-column indexes.

Once you’ve identified a primary column that shows banded behavior, and you’ve decided to put in a BRIN index, it’s OK to add additional correlated columns without worrying too much. They may be a little more weakly correlated than the primary column that caused you to create the BRIN index in the first place.

For example, if you have an orders table that is updated chronologically as orders come in, you may have a correlation between when daylight hours are and when orders come in from North America, Europe, and Asia. This correlation may not be as perfectly banded as the order_time column, but the region_id of the order may have enough of a correlation that it’s worth throwing in as a second column to see if it adds value if you query that frequently. It’s a low-risk experiment to try!


Postscript: Thank you to Postgres Weekly Issue 274 and 5 Minutes of Postgress Episode 38 for featuring this post!