from data import insight :
    About     Archive     Feed

Selecting lines in a Bokeh plot

So, I haven’t actually wanted to be a web developer, but I’m kind of the data chef, cook, and bottle-washer at work at the moment, so spinning up some web code it is, then.

The existing code base is written in Bokeh which I hadn’t used in over a year, but pyLadies Amsterdam did a web-training that I attended to get going again.

This is a mockup of a plot I need for my work website, where we want to be able to select lines from a plot. You can use the box select tool in the upper left to select lines, then click the button on the bottom to change the line color to cyan. The two sides of the plot are linked (which is the feature I wanted to play with), so if you box-select the yellow line on the left, the yellow line on the right also selects. The reset tool in the upper left will put the colors back.

In order to use the selection styling functionality built into Bokeh, I’ve bunched all my lines in each plot into a single MultiLine glyph. Getting the data into the right format takes up a most of the code below.

import random

from bokeh.plotting import figure
from bokeh.models import CustomJS, ColumnDataSource
from bokeh.events import SelectionGeometry, Reset
from bokeh.layouts import column, gridplot
from bokeh.models.widgets import Button
from bokeh.embed import autoload_static
from bokeh.resources import CDN

from palettable.cartocolors.qualitative import Bold_8
import numpy as np
import pandas as pd

# generate a bunch of lines in "long" format
def make_line(line_id):
    line_size = random.choice(range(5,15))
    line_start = random.choice(range(10))
    xs = list(range(line_start, line_start + line_size))
    ys = np.random.sample((line_size)) * random.choice(range(1,3)) \
                + random.choice(range(3))
    zs = np.random.sample((line_size)) * random.choice(range(1,3)) \
                + random.choice(range(3))
    
    df = pd.DataFrame(
        dict(
            x=xs,
            y=ys,
            z=zs,
        )
    )
    df['line_id'] = line_id
    return df
    
line_df = pd.concat(
    [
        make_line(i) for i in range(8)
    ],
    ignore_index = True
).sort_values('x').reset_index(drop=True)

# pack the lines into a format that bokeh MultiLine glyph expects
line_group = line_df.groupby('line_id')
def per_line_data(group, column):
    return [group[column].get_group(i) for i in group.indices]
    
column_data = dict(
    xs=per_line_data(line_group, 'x'),
    ys=per_line_data(line_group, 'y'),
    zs=per_line_data(line_group, 'z'),
    colors=Bold_8.hex_colors, 
    line_id=list(line_group.indices.keys())
)

s1 = ColumnDataSource(column_data)

Now that I have my data, let’s plot. Since I want my plots to be linked, I use the same ColumnDataSource for both the left and right sides. The two plots also have the same x-axis.

common_plot_args = dict(
    plot_width=400, 
    plot_height=400, 
    tools = 'box_select,box_zoom,wheel_zoom,pan,save,reset',
)

xy_plot = figure(title="XY view", **common_plot_args)
xy_plot.xaxis.axis_label = "The mysterious X"
xy_plot.yaxis.axis_label = "The variable Y"

xz_plot = figure(title="XZ view", **common_plot_args)
xz_plot.xaxis.axis_label = "The mysterious X"
xz_plot.yaxis.axis_label = "The parameter Z"

common_multiline_args = dict(
    source=s1, 
    line_color="colors", 
    line_width=3,
    # This is the cool selection styling :-)
    selection_color="black",
    selection_line_width=3,
    nonselection_alpha=.8,
    nonselection_line_width=1,
)

xy_plot.multi_line(xs="xs", ys="ys", **common_multiline_args)
xz_plot.multi_line(xs="xs", ys="zs", **common_multiline_args)

The box select callback finds the data points in the box boundaries and marks those lines associated with those points as selected.

select_code ="""
// box selet callback for both plots
// args:
// s1 = column data source
// xy_names = which column is plotted on x and y axis of current plot
// cb_obj = provided by bokeh showing selected box extent

const x0 = cb_obj['geometry']['x0']
const x1 = cb_obj['geometry']['x1']
const y0 = cb_obj['geometry']['y0']
const y1 = cb_obj['geometry']['y1']
const xs = s1.data[xy_names[0]]
const ys = s1.data[xy_names[1]]

var new_selection = []

// for each line
for (var j=0;j<xs.length;j+=1) {
    
    // grab the points in line j
    const xj = xs[j]
    const yj = ys[j]
    
    // if one point in the line is in the selection
    // box select that line
    
    for (var jj=0;jj<xj.length;jj+=1) {
        const xjj = xj[jj]
        const yjj = yj[jj]
    
        if ((xjj >= x0) && (xjj <= x1) && (yjj >= y0) && (yjj <= y1)) {
            new_selection.push(j)
            break 
        }
        
        // lines are in sorted-by-x order, 
        // no need to search past end of the box
        else if (xjj > x1) {
            break
        }
    }
}

// update s1 with selection
s1.selected['indices'] = new_selection
s1.change.emit()
"""

xy_select_callback =  CustomJS(
                        args=dict(s1=s1, xy_names=['xs','ys']), 
                        code=select_code)
xy_plot.js_on_event(SelectionGeometry, xy_select_callback)

xz_select_callback =  CustomJS(
                        args=dict(s1=s1, xy_names=['xs','zs']),
                        code=select_code)
xz_plot.js_on_event(SelectionGeometry, xz_select_callback)

These last two callbacks

  • Reset the plot to it’s original state
  • Change the color of the selected plots when you press the button below the plots
reset_callback = CustomJS(args=dict(s1=s1), 
code=f"""
// reset callback restores original colors and 
// clears the selection
// args = s1 = column data source
s1.data['colors'] = {Bold_8.hex_colors}
s1.selected['indices'] = []
s1.change.emit()"""
    )

xy_plot.js_on_event(Reset, reset_callback)
xz_plot.js_on_event(Reset, reset_callback)

b = Button(label="Change the color!")
b.js_on_click(CustomJS(args=dict(s1=s1), code="""
// button select callback changes the color of the selected
// lines from whatever they are to 'cyan'
// args = s1 = column data source
    const selection = s1.selected['indices']

    if (selection.length == 0) {
        alert("No line selected")
    }
    for (var j = 0; j < selection.length; j+= 1) {
        s1.data['colors'][selection[j]] = 'cyan'
    }
    s1.selected['indices'] = []
    s1.change.emit()
"""))

The last trick was to get the plot on my blog so I could show it off. Bokeh’s autoload_static method creates two outputs: the body of a script to display your plot, and a <script> tag that loads the script. In order for the script to load properly, you have to give autoload_static the location where you are going to store your script so that the html tag part knows what to put for src=.

bothviews = gridplot([[xy_plot, xz_plot]], sizing_mode='scale_both')
plot_with_button = column(bothviews, b)

script_body, html_tag = autoload_static(plot_with_button, CDN, "/scripts/2020_11_23/two_plots.js")

with open ("two_plots.html",'w') as fp:
    fp.write(html_tag)

with open("two_plots.js",'w') as fp:
    fp.write(script_body)

When I incorporated the script tag into the blog, I wrapped the html part in a div to set the size (and included a resize corner in case you need it - the plot is a bit wider than the default display width of the blog). This blog is in jekyll, so I put the html tag in the _includes/ directory (where jekyll looks when I use the include directive) and the code body in scripts/ where I told bokeh it would be when I created the html.

(Since the include is only one line, I could’ve just pasted it into the post as well.)

<div style="width: 100%; height: 450px; resize:both; overflow:auto">

{% include 2020_11_23/two_plots.html  %}

</div>

Post Bootcamp Perspective

It’s been a while since I started this blog as part of my data science bootcamp. Looking back at my little blog, it seemed like I should either take it down or start using it again. Since it’s NANOWRIMO month, let’s see if I can maybe get it going again?

Since this blog was born in a bootcamp, I thought a good ‘welcome back’ post might be how I look back on my bootcamp experience.

Things you should know before doing a data science bootcamp

  1. Think of a bootcamp as “the icing on the cake.” At the end of the course, you will be “whoever you were” (your cake) plus “shiny new data skills” (your icing). Figuring out how to sell that in the job market depends on your previous experience, what you do during the training, and how well you do the work of imagining your next life. My fellow students with the clearest career goals got jobs faster (one was even hired before she graduated!). I was probably fair-to-average in that respect, but I knew that I didn’t know. The main reason I chose the bootcamp I did was because I had a good feeling about the career advisor. (Thanks for everything, Marybeth!)

  2. Covid seems to have moved all of the in-person bootcamps to online. I personally am glad that I did an in-person class. The ability to practice giving presentations in front of a live group was great pracice for both interviewing and for going back to work. The casual face-to-face conversations in the break room with my cohort and the staff were also important parts of the experience for me. When you’re all working in a computer lab together, it’s easy to ask the person next to you for help, while online you might not really a sense of whether other folks in the room are having the same issue, or if they are ok with being interrupted.

  3. You will not learn all of data science in 12 weeks, but you will learn a lot, and you will learn how to figure out “just in time” learning, which is the only way to keep up with a constantly changing field. Expect to have to continue to study afterwards to prepare for interviews and to continue to grow on the job. But, also, once you’re away from the experience a bit, you might appreciate it more than you do a week after you graduate. At the end of the whirlwind of curriculum and projects, it’s maybe easier to have a feel for how much you still have to learn than to appreciate your accomplishments.