Things I Learned From the SciPy 2019 Lightning Talks
This post summarizes the interesting aspects of the Day One of the SciPy 2019 lightning talks, a flash round of a dozen ~3 minute talks covering a wide variety of topics.
SciPy 2019 wrapped up this past weekend, and all of its talks are already online.
While I plan to get through as many of these talks as possible, I decided to whet my appetite with the lightning talks.
As such, here are a few things I learned from day one of the SciPy 2019 lightning talks.
Hi, I'm your technical interviewer: Advice for breaking into the industry, Paul Anzel
This lightning talk presented advice on transitioning from academia to industry.
Don't make these mistakes:
- not networking - you need to network,; ~80% jobs aren't posted, need insider info
- not using referrals - can help skip first few interview steps
- not using exact job description language - automated systems aren't very smart (sometimes technical recruiters aren't either), so use exact description wording (job asks for machine learning is description, make sure the words are explicitly in your resume)
- not covering your basis - specifically mentions SQL, version control, software carpentry for minimum viable knowledge
- worrying about every job line item - aim for a hit rate of 66-75% with "required experience"
- not having a portfolio or blog - gives interviewer something to discuss with you, shows you can code, communicate; make some nice looking presentation web pages and readme files
US Research Software Sustainability Insittiute (URSSI) Pilot "Summer" School, Kyle Niemeyer
The "summer school" mission: "To improve the quality, usefulness, and sustainability of research software by improving practices and increasing diversity of practitioners."
Broader goal: Most people aren't trained to specifically write research software, and so a large gap exists between learning the basics of coding and being expected to write research software, a gap which can be filled with proper training.
The project is in a planning phase at this point, with a pilot project currently running. The hope is that funding will extend the school.
Topics covered include:
- software design and modularity
- collaborative software development via GitHub
- software testing in Python
- peer code review
- packaging and distributing Python software
- documentation
- licensing, open sharing, and software citation
The "summer" school takes place Dec 17-19, 2019, at the eScience Institute at the University of Washington in Seattle. It is aimed at early career researchers, and the pilot program will be free, as it is supported by an NSF grant.
Using numba to go from byutes to DataFrame, fast!, Tim Swast
Problem: how to go from Avro bytes (representing BigQuery tables) to Pandas DataFrame quickly; can be done with fastavro in Python, but it's too slow
Solution: manually write code to parse your specific Avro schema, generate code based on Avro schema and then JIT that code???
We can JIT in Python with numba! numba is pretty easy to use: no build steps (compare with Cython), just Python code.
Original problem: parse 150 MB of Avro bytes to Pandas DF with fastavro - took 75 seconds
New solution: 5 seconds
This new solution uses Arrow as an intermediate format, as it better supports NULL values and there is a clear path from bytes -> Arrow arrays & table -> Pandas DF.
Fast things make Tim happy.
napari - Multidimensional image viewer for Python, Nicholas Sofroniew (I believe)
Short overview of image viewer which:
- integrates with Python scientific stack
- supports very large datasets
- is designed for scientific use cases (though it has features you might find in, say, Adobe Illustrator)
Leveraging open source software to recreate the coastline, Kim Pevey
Problem: Aerial image modeling to digitize coastline traditionally took a long time (weeks - months)
Solution: Open source tools changes this to 3-5 minutes!
- The project goes out to grab web data and delineates images
- Allows for easily modifiable image vertices, addition of notes
Ibis: Python data analysis productivity framework, Ivan Ogasawara
- Ibis helps those who don't know SQL to create expressions and grab data from databases
- has a Pandas API
- returns a Pandas DataFrame
- adding geospatial data support to help with plotting of geospatial data
- supports trigonometric operations and statistical operations
Why use Ibis?
- if you get data from a SQL database
- if create statements mentally which are hard to maintain
- want to work with big data
Dask-Gateway, Jim Crist
Dask is like Spark, but "friendly and flexible," and in Python.
Where to deploy Dask? Kubernetes, YARN, HPC Clusters
Dask-Gateway is to accommodate common feature requests from some users which seem above and beyond "just Dask," such as:
- central management
- restrict user permissions, resources
- proxy out network connections
- support long running clusters
- encryption and authentication
"Like JupyterHub, but for Dask"
Not well documented :) What there is can be found here.
Frankenstein's model, or understanding your monster, Dillon Niederhut
- Linear models have shortcomings, need complex models for real world problems
- Complex models are black boxes
- Complex models are not easily legible
The factorials in the above equation generally make this non-tractable for any non-trivial problem. SHAP takes a stab at doing this, and for some models it has been shown that the time required is quadratic to the depth of the trees predicting on.
How can we make a non-linear model more easily interpretible? Train a linear model to predict the behavior of the non-linear one; this linear model would then be much more interpretible.
The tornado image above can show correlations of features with positive and negative outcomes (in this case, burrito quality), which are learned from the complex non-linear model (in this case, random forest). Can generate actionable insights as an outcome, yet still harness the power of a more complex non-linear model to do the heavy lifting.
LFortran - A modern ineractive LLVM-based Fortran compiler
Do you use Python for prototyping and algorithm development, but then have to translate Python to a production language manually in order to see the actual benefits? If your production is Fortran, why not directly compile and execute Fortran in Jupyter notebooks instead!
The project is in early stages, with a roadmap, envisions a scenario where Fortran can be used in Jupyter to prototype and then have the notebook output modern, usable Fortran for a production environment, among other uses.
Where to put your custom code?, Chris Barker
This talk is a suggestion for how to manage your personal library of python functions you might use for scripting, data analysis, etc.
- don't copy and paste code into new projects
- don't put all your functions into a file and add it to the python path
- make a package!
Python packaging isn't just for distribution, but can be used for this purpose as well. Just don't take the step of putting your personal packages on PyPI.
How? It's easy; use the proper tree directory:
my_code my_code __init__.py some_code.py setup.py
Though Python package documentation says the setup.py requires requirements, metadata, and more for distribution, but this can be overcome with a few simple lines for your own packages:
setup(name='my_code', packages=['my_code'], )
That's it! To make it simper, Chris has created a script which does all of this packaging for you.
Let's Do This Together: Diversity and Inclusion, and Mental Wellness, Steven Silvester
This talk is a brief discussion of the benefits of diversity and inclusion, and how it relates to open source development, coding and more.
This is followed up with a personal story of mental wellness and how to get a handle on it.
The mouse aging cell atlas aka Cell biology meets Python, Angela Oliveira Pisco
Using scientific python, Angela and team have been able to manage .5 million cells across 20 tissues across 6 different mouse ages. Cells were classified, an annotated collection of cells was created, and this was made public for anyone to use.
As a result, they were able to answer numerous cellular biology questions after classification and cluster analysis. An example of this is "do the fraction of cells responsible for a particular poor biological behavior change over time?"
Related:
- Notes on Feature Preprocessing: The What, the Why, and the How
- Python Data Science for Beginners
- Top 20 Python Libraries for Data Science in 2018