This meme floated across data engineering LinkedIn recently, and I have to say I disagree with
throwing the docs into the fire here. It’s absolutely true that the reference page
pd.read_csv
has an overwhelming number of options – I counted 52 (including the deprecated ones), but all those options mean that this function can handle just about any file you want to throw at it!
If you’re not going to memorize all of those options, here are some of the ones I use most frequently:
usecols
One easy way to speed up reading a csv file is to not process the columns you don’t need, rather than reading
the whole file and dropping columns later. You can also designate one of the columns as the index with index_col (which, despite the singular name, can also take a list if you want a multicolumn index).
dtype
Setting the datatypes of certain columns can save on memory or clarify whether a column represents a floating point,
integer, or boolean value.
converters
on_bad_lines (especially setting it to a callback function)
Writing your own per-column data converters and/or adding an on_bad_lines callback means you can load a file and do data cleaning or unit conversions in one step. The argument to your converter lambda function will be the raw string contents
of the column, while the on_bad_lines callback will get a list of all the column values.
Many of the other options on that long documentation page are actually ways of doing simple conversions without having to type up your own lambda function: true_values and false_values convert values to booleans, and na_values marks values that
are nulls/nans. Arguments parse_dates, infer_datetime_format, date_parser, day_first and cache_dates control automatically converting columns to dates/times, and thousands and decimal control how strings are converted to numbers.
nrows
chunksize
These optiona are useful for large files to get a sniff or to process the data in chunks. nrows will read a subset of the data, while chunksize will turn the output of read_csv into an interator to process in chunks:
A powerful function for such a common operation is worth spending the time to understand in depth. If the
reference page is too terse,
there more details and examples in the I/O guide
here.
I’m working on a new PostgreSQL database at work, and it’s just
getting big enough where getting the indexes right is
becoming more noticeable.
I just scored a big “win” by adding a properly
thought-out BRIN index
to a particular table, but it took me a bit to get there.
What I didn’t know about BRIN could fill a blog post
The first half of this post are introductory
and cover the obvious and better-documented aspects
of the BRIN index. The second half is the subtle
aspects I needed to figure out in order to use the index
effectively for my use case.
BRIN indexes – the easy bits
Lots of blog posts/tutorials explain clearly what a BRIN index
is and how to create and use one. If you know the basics about a BRIN
index, you can
skip to the next part of this post. If you aren’t familiar with
this index type, and my brief explanations aren’t
enough detail, you can google it and find plenty of basic explainers.
Of course, the Postgresql docs are always a good source of information as well.
What is a BRIN index?
A BRIN index is a Block Range INdex.
It is a lightweight way to index data in large tables that is “banded” in
physical storage. For example, if your data is typically added chronologically
and has a date column, the data layout on the table will be banded by date.
Given the date on one row, the date on the next row is almost certainly the
same date or the next day’s date.
It works by creating a data structure that stores, for each “block range,”
the minimum and maximum value of the indexed column for that block of rows.
When searching for the rows that match the indexed column, the query
can go block-by-block as a first pass, rather than row-by-row.
Only the matching blocks need to be checked row-by-row.
By default, BRIN indexes need to be maintained explicitly
The default behavior of a BRIN index is to NOT keep the index
up to date with the table. This makes table inserts go faster, but
means that the DBA must call the appropriate
index maintenance functions
or the index must be created with autosummarize = 1 .
Unindexed blocks have to be searched row-by-row.
If you create a BRIN index with the default settings and never
summarize it, the index will have no effect on your queries.
Brin indexes, the subtle bits
How many rows are in a block?
Or, why the default block size is almost certainly wrong for your particular table
The benefits of searching block-by-block are modulated by the
number of rows in a block: Are you going by 10’s or 1000’s or 100,000’s?
You want the blocks to be big enough that searching
block-by-block creates a noticeable benefit, and small enough that when you
hit the case where you have to search a block row-by-row, you’re not walking through
a million rows to only end up using three of them.
The index block size is specified in database pages.
You will need to know the patterns in your table and your typical queries in
order to pick a good number of rows to put in your blocks, and then you
need to do the
storage math to determine how many database pages that number of rows uses.
This is the query I came up with to determine a good value to use
as my estimate for storing one row of my table:
/*
* Estimate the amount of storage used by one row
* of table 'my_table'
*/SELECT24+-- base overhead for any rowSUM(CASEWHENattlen>0THENattlen-- storage size of column type when not NULL-- but attlen == -1 means the column can be TOASTedELSE18-- overhead for a TOASTed columnEND)FROMpg_tablesptJOINpg_classpcONpt.tablename=pc.relnameJOINpg_attributepaONpa.attrelid=pc.oidWHEREpt.tablename='my_table';
Once you decide on your best estimate for the row size of your table,
and your best estimate for the number of rows to put in a block, you
need to figure out how many pages to put in a block and set the
pages_per_range storage parameter. There are 8192 bytes in a page,
so 8192 divided by row size gives you the rows per page. The
desired rows per block divided by the rows per page gives you the
desired number of pages per block.
The default number of pages for a BRIN index block is 128. For a narrow
table, this may be too many. For a wide table, it may be too few.
If you don’t do the math, you’ll never know.
Go ahead and throw in more columns!
On my umpteenth trip through the Postgres docs trying to get the best
speed-up with my new index, I came across this:
A multicolumn BRIN index can be used with query conditions that involve any subset of the index’s columns. Like GIN and unlike B-tree or GiST, index search effectiveness is the same regardless of which index column(s) the query conditions use. The only reason to have multiple BRIN indexes instead of one multicolumn BRIN index on a single table is to have a different pages_per_range storage parameter. source
That’s very different from the other indexes I use frequently: B-tree indexes
can be multicolumn, but because the tree is built in a particular
column order, it’s hard to make a B-tree that is effective
for more than one or two columns. Hash indexes are restricted to a single column.
With BRIN indexes, adding additional columns to the index costs very little!
You don’t have to worry about column order, and Postgres will construct the
multicolumn index efficiently so there is no reason to create two single-column
indexes.
Once you’ve identified a primary column that shows
banded behavior, and you’ve decided to put in a BRIN index, it’s OK
to add additional correlated columns without worrying too much. They may be a little
more weakly correlated than the primary column that caused you to create
the BRIN index in the first place.
For example, if you have an orders table that is updated chronologically
as orders come in, you may have a correlation between when daylight hours are
and when orders come in from North America, Europe, and Asia.
This correlation may not be as perfectly banded as the order_time column, but
the region_id of the order may have enough of a correlation
that it’s worth throwing in as a second column to see if it adds value if
you query that frequently.
It’s a low-risk experiment to try!