The Dusty Deck: 2021

I have recently watched several data scientists use #Jupyter #notebooks, and I felt young again!

Notebooks are basically a command-line interface (CLI) on steroids. They make it easy to re-run commands, possibly after changing them, using a modern graphical user interface in a browser. Experimentation is very easy, since you can modify cells and try out various ideas, while keeping the computation state. Unlike textual interfaces, they can show graphical output directly in the same browser window. And you can also add documentation, including graphics, to create a nicely-formatted document that can be used as a live demonstration or explanation.

Like any other CLI, you are free to re-do any action in any order. This freedom comes with the usual cost: you can get confused about the state of your computation. In such cases, you can run the notebook from the start in order to get a consistent state. However, if you are using a notebook in order to investigate the results of a lengthy computation, you will be reluctant to do that, and will have to be more selective about what you re-run.

Occasionally (well, frequently, in my experience) you will want to debug the behavior of your code. Jupyter notebooks do not support a debugger, so you will have to resort to sprinkling print statements in the code, and analyze the output. This is what I used to do when programming in Fortran using punched cards in the 1970s; and removing the print statements was as easy as removing a few cards from the deck!

These days, however, developers are used to having IDEs with interactive debuggers that support stopping at any statement, inspecting the state of the execution stack, and even making modifications before continuing. After enjoying such tools, it is difficult for people like me to revert to debugging with print statements.

IDEs have a lot of other features; notebooks have syntax highlighting, but they don't have:

Context-sensitive completion. Immediately during typing, or on request, the #IDE will provide a list of possible completions for the partial text already written. These depend on the syntactic context, as well as on type information. For example, after typing x.f, the suggestions will include all methods and fields whose names start with "f" and belong to the class of x. The determination of the class of x requires type analysis. This isn't easy even for strongly-typed languages, such as #Java, because the text of the program is incomplete and doesn't even compile. For dynamic languages such as #Python, this is even more difficult. This difficulty applies not only to automated tools, but also to people, which is why this feature is particularly helpful.
Interactive error indications. The IDE will flag parts of the text as likely to contain errors of various kinds. These include syntax errors; references to undefined variables or methods; type errors (again particularly difficult, and particularly helpful, for dynamic languages); and shadowing of global variables or variables of enclosing contexts.
Intelligent search. This includes "jump to definition" for variables, methods, and types; list all uses of a variable, method, or type (using type analysis to distinguish different elements with the same name); find overridden or overriding methods; and search by element kind with word completion (for example, a search for a class using the partial name DoI will find a class named DomainInformation.
Automatic correction suggestions ("auto-fix"). For many types of errors, the IDE can suggest one or more ways to fix them; these will be done automatically based on the user's choice.
Syntactic transformations. These are things like adding placeholders for abstract methods that need to be implemented in a subclass; commenting or uncommenting a section of code; surrounding an expression with various kinds of parentheses or quotes; and many others.
Refactoring. These are behavior-preserving transformations that can have global effects, and can be difficult to do manually. Examples are renaming a variable, method, or class; extracting an expression into a variable for reuse; extracting a section of code to a separate function or method; inlining a variable or method; moving global elements between modules; and moving methods up or down the inheritance hierarchy.

All of these capabilities require deep analysis of the code, including type, data-flow, and control-flow analyses. These are made difficult by more permissive language features (especially dynamic types), and are not always completely accurate. Still, they offer a great boost to productivity, and I have come to rely on all of them, to the extent that I try to minimize my typing and let the IDE do as much of the work as possible. I'm very conscious of error or warning indications; I always try the auto-fix capability, and in many cases find an automatic solution. Refactoring is especially useful; in fact, the availability of refactoring in Eclipse many years ago was the ultimate factor in my moving to IDEs from my beloved Emacs editor for code development.

I recently heard a talk by a self-titled "Chief Data Scientist," who only uses Jupyter notebooks for code. When asked whether he misses the capabilities of IDEs, his reply was that anyone who needs those for data science is working at a too-low level. Presumably he limits himself to using existing tools for data exploration, with a minimum of actual programming.

This is a valid position, but my response would be that data scientists who don't need the capabilities I listed above are working at a too-high level, either avoiding serious programming, or depriving themselves of productivity-boosting tools that developers have come to take for granted in the past few decades. I personally suffer when I have to watch them struggling to manually do things that can easily be automated.

While notebooks have many advantages, they are still in essence command-line tools. It is time that advanced IDE features be added to notebooks, to create a really state-of-the-art environment not only for data scientists but for developers as well. I have seen (but not tested) plugins for a number of popular IDEs to support notebooks, but the browser-based Jupyter notebooks still seem to be the tool of choice for data scientists.

Addendum: Thanks to Lev Greenberg for showing me the notebook capabilities of Microsoft's free IDE, Visual Studio Code. They satisfy a lot of the requirements mentioned above, making VSCode a much better environment for code development in notebooks than the web-based Jupyter Notebooks.

The Dusty Deck

Labels

Thursday, December 2, 2021

The Kyng of the Gods

Followers

Blog Archive