pingouin is a python package for
calculating statistics on data organized in pandas dataframes. It has
an easier to use interface than stats-models and a batteries-included
philosophy where operations that maybe take multiple function calls in
scipy.stats are rolled into one call in pingouin. The author calls
it simple-but-exhaustive statistics.
While poking around the capabilities of this package, I discovered the
circular module. I’d never heard of circular statistics before, but
there are a lot of angles in the data I work with every day – and angles
don’t always “play nice” with other types of scalar data.
For example, if we’re talking about compass headings, the difference between
a heading of 45 degrees and a heading of 48 degrees is 2 degrees.
But, the difference between a heading of 358 degrees and 0 degrees is also
2 degrees. You can’t treat angles like other types of scalar data.
Even if you don’t work in physical coordinate systems, any data
with dates and times can be worked in circular statistics.
A year is a 360 degree trip around the sun. A day is a 360 degree
rotation of the earth. Anything that cycles can
be turned into an angle. Tuesday is just an angle away from Sunday.
By converting date-times to radians, you can treat 11:59pm on Monday and
12:02am on Tuesday as the near-identical times they really are, rather than as
two separate calendar days.
I generally think in degrees (it’s easier
to imaging a 15 degree angle in my mind than a .26 radian angle, but
trigonometry runs on radians.
So, the first thing we need to do with any type of circular data analysis is
convert it into radians.
For radar data, np.radians works great. But for other types of data,
pingouin’s covert_angles function
lets you specify the number of “units” in your particular kind of circle.
Once your data is in radians, what sort of stats can you do?
There is no simple analog to linear regression, unfortunately, but
there are correlation functions for circular-to-circular or circular-to-scalar
variables, ways to calculate values analogous to scalar mean and variance,
and checks for a uniform circular distribution.
So, I haven’t actually wanted to be a web developer, but
I’m kind of the data chef, cook, and bottle-washer at work at the
moment, so spinning up some web code it is, then.
The existing code base is written in Bokeh which I hadn’t used
in over a year,
but pyLadies Amsterdam
did a web-training
that I attended to get going again.
This is a mockup of a plot I need for my work website, where
we want to be able to select lines from a plot.
You can use the box select tool in the upper left to select lines, then
click the button on the bottom to change the line color to cyan.
The two sides of the plot are linked (which is the feature I wanted
to play with), so if you box-select
the yellow line on the left, the yellow line on the right
also selects.
The reset tool in the upper left will put the colors back.
In order to use the selection styling functionality built into Bokeh,
I’ve bunched all my lines in each plot into a single MultiLine glyph.
Getting the data into the right format takes up a most of the code
below.
importrandomfrombokeh.plottingimportfigurefrombokeh.modelsimportCustomJS,ColumnDataSourcefrombokeh.eventsimportSelectionGeometry,Resetfrombokeh.layoutsimportcolumn,gridplotfrombokeh.models.widgetsimportButtonfrombokeh.embedimportautoload_staticfrombokeh.resourcesimportCDNfrompalettable.cartocolors.qualitativeimportBold_8importnumpyasnpimportpandasaspd# generate a bunch of lines in "long" format
defmake_line(line_id):line_size=random.choice(range(5,15))line_start=random.choice(range(10))xs=list(range(line_start,line_start+line_size))ys=np.random.sample((line_size))*random.choice(range(1,3)) \
+random.choice(range(3))zs=np.random.sample((line_size))*random.choice(range(1,3)) \
+random.choice(range(3))df=pd.DataFrame(dict(x=xs,y=ys,z=zs,))df['line_id']=line_idreturndfline_df=pd.concat([make_line(i)foriinrange(8)],ignore_index=True).sort_values('x').reset_index(drop=True)# pack the lines into a format that bokeh MultiLine glyph expects
line_group=line_df.groupby('line_id')defper_line_data(group,column):return[group[column].get_group(i)foriingroup.indices]column_data=dict(xs=per_line_data(line_group,'x'),ys=per_line_data(line_group,'y'),zs=per_line_data(line_group,'z'),colors=Bold_8.hex_colors,line_id=list(line_group.indices.keys()))s1=ColumnDataSource(column_data)
Now that I have my data, let’s plot.
Since I want my plots to be linked, I use the same ColumnDataSource for both the
left and right sides. The two plots also have the same x-axis.
common_plot_args=dict(plot_width=400,plot_height=400,tools='box_select,box_zoom,wheel_zoom,pan,save,reset',)xy_plot=figure(title="XY view",**common_plot_args)xy_plot.xaxis.axis_label="The mysterious X"xy_plot.yaxis.axis_label="The variable Y"xz_plot=figure(title="XZ view",**common_plot_args)xz_plot.xaxis.axis_label="The mysterious X"xz_plot.yaxis.axis_label="The parameter Z"common_multiline_args=dict(source=s1,line_color="colors",line_width=3,# This is the cool selection styling :-)
selection_color="black",selection_line_width=3,nonselection_alpha=.8,nonselection_line_width=1,)xy_plot.multi_line(xs="xs",ys="ys",**common_multiline_args)xz_plot.multi_line(xs="xs",ys="zs",**common_multiline_args)
The box select callback finds the data points in the box
boundaries and marks those lines associated with those points as selected.
select_code="""
// box selet callback for both plots
// args:
// s1 = column data source
// xy_names = which column is plotted on x and y axis of current plot
// cb_obj = provided by bokeh showing selected box extent
const x0 = cb_obj['geometry']['x0']
const x1 = cb_obj['geometry']['x1']
const y0 = cb_obj['geometry']['y0']
const y1 = cb_obj['geometry']['y1']
const xs = s1.data[xy_names[0]]
const ys = s1.data[xy_names[1]]
var new_selection = []
// for each line
for (var j=0;j<xs.length;j+=1) {
// grab the points in line j
const xj = xs[j]
const yj = ys[j]
// if one point in the line is in the selection
// box select that line
for (var jj=0;jj<xj.length;jj+=1) {
const xjj = xj[jj]
const yjj = yj[jj]
if ((xjj >= x0) && (xjj <= x1) && (yjj >= y0) && (yjj <= y1)) {
new_selection.push(j)
break
}
// lines are in sorted-by-x order,
// no need to search past end of the box
else if (xjj > x1) {
break
}
}
}
// update s1 with selection
s1.selected['indices'] = new_selection
s1.change.emit()
"""xy_select_callback=CustomJS(args=dict(s1=s1,xy_names=['xs','ys']),code=select_code)xy_plot.js_on_event(SelectionGeometry,xy_select_callback)xz_select_callback=CustomJS(args=dict(s1=s1,xy_names=['xs','zs']),code=select_code)xz_plot.js_on_event(SelectionGeometry,xz_select_callback)
These last two callbacks
Reset the plot to it’s original state
Change the color of the selected plots when you press the button below the plots
reset_callback=CustomJS(args=dict(s1=s1),code=f"""
// reset callback restores original colors and
// clears the selection
// args = s1 = column data source
s1.data['colors'] = {Bold_8.hex_colors}
s1.selected['indices'] = []
s1.change.emit()""")xy_plot.js_on_event(Reset,reset_callback)xz_plot.js_on_event(Reset,reset_callback)b=Button(label="Change the color!")b.js_on_click(CustomJS(args=dict(s1=s1),code="""
// button select callback changes the color of the selected
// lines from whatever they are to 'cyan'
// args = s1 = column data source
const selection = s1.selected['indices']
if (selection.length == 0) {
alert("No line selected")
}
for (var j = 0; j < selection.length; j+= 1) {
s1.data['colors'][selection[j]] = 'cyan'
}
s1.selected['indices'] = []
s1.change.emit()
"""))
The last trick was to get the plot on my blog so I could show it off.
Bokeh’s autoload_static method creates two outputs: the body of a script to
display your plot, and a <script> tag that loads the script. In order
for the script to load properly, you have to give autoload_static the
location where you are going to store your script so that the html tag
part knows what to put for src=.
When I incorporated the script tag into the blog, I wrapped the html part
in a div to set the size (and included a resize corner in case you need it - the plot is
a bit wider than the default display width of the blog). This
blog is in jekyll, so I put the html tag in the _includes/
directory (where jekyll looks when I use the include directive) and the code body in
scripts/ where I told bokeh it would be when I created the html.
(Since the include is only one line, I could’ve just pasted it into the post as well.)