Things I Learned From the SciPy 2019 Lightning Talks

This post summarizes the interesting aspects of the Day One of the SciPy 2019 lightning talks, a flash round of a dozen ~3 minute talks covering a wide variety of topics.



SciPy 2019 wrapped up this past weekend, and all of its talks are already online.

While I plan to get through as many of these talks as possible, I decided to whet my appetite with the lightning talks.

As such, here are a few things I learned from day one of the SciPy 2019 lightning talks.

 
Hi, I'm your technical interviewer: Advice for breaking into the industry, Paul Anzel

This lightning talk presented advice on transitioning from academia to industry.

I'm your technical interviewer

Don't make these mistakes:

  1. not networking - you need to network,; ~80% jobs aren't posted, need insider info
  2. not using referrals - can help skip first few interview steps
  3. not using exact job description language - automated systems aren't very smart (sometimes technical recruiters aren't either), so use exact description wording (job asks for machine learning is description, make sure the words are explicitly in your resume)
  4. not covering your basis - specifically mentions SQL, version control, software carpentry for minimum viable knowledge
  5. worrying about every job line item - aim for a hit rate of 66-75% with "required experience"
  6. not having a portfolio or blog - gives interviewer something to discuss with you, shows you can code, communicate; make some nice looking presentation web pages and readme files

 
US Research Software Sustainability Insittiute (URSSI) Pilot "Summer" School, Kyle Niemeyer

The "summer school" mission: "To improve the quality, usefulness, and sustainability of research software by improving practices and increasing diversity of practitioners."

Broader goal: Most people aren't trained to specifically write research software, and so a large gap exists between learning the basics of coding and being expected to write research software, a gap which can be filled with proper training.

The project is in a planning phase at this point, with a pilot project currently running. The hope is that funding will extend the school.

Image

Topics covered include:

  • software design and modularity
  • collaborative software development via GitHub
  • software testing in Python
  • peer code review
  • packaging and distributing Python software
  • documentation
  • licensing, open sharing, and software citation

The "summer" school takes place Dec 17-19, 2019, at the eScience Institute at the University of Washington in Seattle. It is aimed at early career researchers, and the pilot program will be free, as it is supported by an NSF grant.

 
Using numba to go from byutes to DataFrame, fast!, Tim Swast

Problem: how to go from Avro bytes (representing BigQuery tables) to Pandas DataFrame quickly; can be done with fastavro in Python, but it's too slow

Solution: manually write code to parse your specific Avro schema, generate code based on Avro schema and then JIT that code???

We can JIT in Python with numba! numba is pretty easy to use: no build steps (compare with Cython), just Python code.

Original problem: parse 150 MB of Avro bytes to Pandas DF with fastavro - took 75 seconds

New solution: 5 seconds

This new solution uses Arrow as an intermediate format, as it better supports NULL values and there is a clear path from bytes -> Arrow arrays & table -> Pandas DF.

Fast things make Tim happy.

 
napari - Multidimensional image viewer for Python, Nicholas Sofroniew (I believe)

Short overview of image viewer which:

  • integrates with Python scientific stack
  • supports very large datasets
  • is designed for scientific use cases (though it has features you might find in, say, Adobe Illustrator)

Image

 
Leveraging open source software to recreate the coastline, Kim Pevey

Problem: Aerial image modeling to digitize coastline traditionally took a long time (weeks - months)

Solution: Open source tools changes this to 3-5 minutes!

  • The project goes out to grab web data and delineates images
  • Allows for easily modifiable image vertices, addition of notes

 
Ibis: Python data analysis productivity framework, Ivan Ogasawara

  • Ibis helps those who don't know SQL to create expressions and grab data from databases
  • has a Pandas API
  • returns a Pandas DataFrame
  • adding geospatial data support to help with plotting of geospatial data
  • supports trigonometric operations and statistical operations

Why use Ibis?

  • if you get data from a SQL database
  • if create statements mentally which are hard to maintain
  • want to work with big data

 
Dask-Gateway, Jim Crist

Dask is like Spark, but "friendly and flexible," and in Python.

Where to deploy Dask? Kubernetes, YARN, HPC Clusters

Dask-Gateway is to accommodate common feature requests from some users which seem above and beyond "just Dask," such as:

  • central management
  • restrict user permissions, resources
  • proxy out network connections
  • support long running clusters
  • encryption and authentication

Image

"Like JupyterHub, but for Dask"

Not well documented :) What there is can be found here.

 
Frankenstein's model, or understanding your monster, Dillon Niederhut

  • Linear models have shortcomings, need complex models for real world problems
  • Complex models are black boxes
  • Complex models are not easily legible

Image

The factorials in the above equation generally make this non-tractable for any non-trivial problem. SHAP takes a stab at doing this, and for some models it has been shown that the time required is quadratic to the depth of the trees predicting on.

How can we make a non-linear model more easily interpretible? Train a linear model to predict the behavior of the non-linear one; this linear model would then be much more interpretible.

Image

The tornado image above can show correlations of features with positive and negative outcomes (in this case, burrito quality), which are learned from the complex non-linear model (in this case, random forest). Can generate actionable insights as an outcome, yet still harness the power of a more complex non-linear model to do the heavy lifting.

 
LFortran - A modern ineractive LLVM-based Fortran compiler

Do you use Python for prototyping and algorithm development, but then have to translate Python to a production language manually in order to see the actual benefits? If your production is Fortran, why not directly compile and execute Fortran in Jupyter notebooks instead!

Image

The project is in early stages, with a roadmap, envisions a scenario where Fortran can be used in Jupyter to prototype and then have the notebook output modern, usable Fortran for a production environment, among other uses.

 
Where to put your custom code?, Chris Barker

This talk is a suggestion for how to manage your personal library of python functions you might use for scripting, data analysis, etc.

  • don't copy and paste code into new projects
  • don't put all your functions into a file and add it to the python path
  • make a package!

Python packaging isn't just for distribution, but can be used for this purpose as well. Just don't take the step of putting your personal packages on PyPI.

How? It's easy; use the proper tree directory:

my_code
    my_code
        __init__.py
        some_code.py
    setup.py


Though Python package documentation says the setup.py requires requirements, metadata, and more for distribution, but this can be overcome with a few simple lines for your own packages:

setup(name='my_code',
      packages=['my_code'],
      )


That's it! To make it simper, Chris has created a script which does all of this packaging for you.

 
Let's Do This Together: Diversity and Inclusion, and Mental Wellness, Steven Silvester

This talk is a brief discussion of the benefits of diversity and inclusion, and how it relates to open source development, coding and more.

This is followed up with a personal story of mental wellness and how to get a handle on it.

 
The mouse aging cell atlas aka Cell biology meets Python, Angela Oliveira Pisco

Using scientific python, Angela and team have been able to manage .5 million cells across 20 tissues across 6 different mouse ages. Cells were classified, an annotated collection of cells was created, and this was made public for anyone to use.

Image

As a result, they were able to answer numerous cellular biology questions after classification and cluster analysis. An example of this is "do the fraction of cells responsible for a particular poor biological behavior change over time?"

 
Related: