Most people can lift one kilogram, but would struggle to lift one hundred, and could not lift a thousand without planning and support. Similarly, many researchers who can write a… Click to show full abstract
Most people can lift one kilogram, but would struggle to lift one hundred, and could not lift a thousand without planning and support. Similarly, many researchers who can write a few lines of Python, R, or MATLAB to create a plot struggle to create programs that are a few hundred lines long, and don’t know where to start designing an application made up of dozens or hundreds of files or packages. The core challenge is that programming in the large is qualitatively different from programming in the small. As the number of pieces in a program grows, the number of possible interactions between those pieces grows much more quickly because N components can be paired in N ways. Programmers who don’t manage this complexity invariably produce software that behaves in unexpected (usually unfortunate) ways and cannot be modified without heroic effort. This paper presents a dozen tips that can help data scientists design large programs. These tips are taken from published sources like [1,2], the author’s personal experience, and discussions over 35 years with the creators of widely used libraries and applications. They aren’t always appropriate—for example, short scripts written for exploratory data analysis don’t need to worry about unit testing—but if you find yourself sketching data structures on the whiteboard, thinking about how different configuration options interact, or wondering how you’re going to support old releases while working on the new one, these tips may help.
               
Click one of the above tabs to view related content.