35 Skills to Ship Data Science Products

Intro

A few years ago I set out to transition from academia to industry. My academic experience was in spatial statistics, machine learning, and environmental science. Data science seemed like my most promising option. I identified 3 key tools and related skills to master to become a data scientist:

  1. Git - Collaboration, organization, experimentation
  2. Linux - Operating system of web servers
  3. Python - General purpose programming

It's coming to the 3rd anniversary of identifying those skills. Initially, I didn't understand why these three items were important, but something led me towards them instead of the pursuit of further expertise in methodologies quantitative science.

The importance of that triad of tools is in deploying data scientific work in production applications. If data scientific work is not deployed in production applications, I wonder how it could be working at the volume necessary to yield a business return. In addition, When I do hear of dissatisfaction with data science staff from management, it often reflects an experience of a lot of time in research with little product deployed.

git linux python

After three years, I only have a minor case of imposter syndrome. But it took more than learning Git, Linux, and Python. It took 32 more by my reckoning!

From what I've observed, many data science professionals who struggle to put their remarkable knowledge and quantitative expertise into production are lacking in these 32 items. I work in a small company and have to be able to do a lot, so some of these may not apply if you work in a larger team.

The list is provided below. It is intended is for people who came from quantitative sciences and moved to development, not the reverse. 32 is a lot, so I separated them into 4 contexts:

  • Business
  • Collaboration
  • Development
  • Shipping

Business

  1. Mentors
    • Outside of open source, it's crucial to have mentors that are invested in your success. Thanks Bert & Albert!
  2. Fail
    • A handful of failures in all of these items are crucial information to direct focus. Will also be able to see pitfalls in future work.
  3. Pragmatic imperfection
    • 80% code quality and 80% quantitative analysis sophistication is probably good enough. It was challenging to transition away from an academic mode of thinking to development, and after learning development it continues to be difficult to find the balance between iterating, maintaining code quality, and leveraging the latest and most powerful analytics!
  4. Interview
    • In 3 years I did interviewed with 6 companies from startups to Amazons, and a couple other companies for roles with less of a development focus. Keep doing it to measure yourself against the labour demand.
  5. Industry familiarization
    • Even if you aren't in industry as a data science developer yet, it is important to reflect on the craft in the context of a specific industry. Get to know whatever one (e.g. commerce, finance, insurance) interests you most.
  6. User stories/Value stories
    • Be prepared to demonstrate how data scientific work provides value to the business

Collaboration

  1. Teach development to others
    • Keep in touch with what it is like to be new to a specific field, solidify your knowledge, and gain some confidence!
  2. Meetups
    • Talk with peers about data science & development
  3. Reproducible work
    • A description (i.e. paper) is not sufficient in software industry, your developer colleagues will expect the work to prove itself correct in usage, not just design & implementation. Strive to make it possible for someone with reasonable familiarity (i.e. domain, but not subject matter knowledge) to contribute to your work with a few days to familiarize.
  4. Thinking of others and your code
    • Including your future self!
  5. Naming
    • Initially it may seem both onerous and trivial to think of good names for things. Resist the urge to give things arbitrary names!
  6. Refactoring
    • See common functionality in several places in the code? Make it into a function. See a function that is longer than the screen height? Pull out sections into smaller functions.
  7. Tidy
    • Actively fight code rot & spaghetti code or it will eventually cripple development. It's not like a battle or war; there's no eventual or immediate victor. It's more like dental hygiene. Every day. Twice a day. Brush and floss.
  8. Open source
    • Experience collaborative software development before doing it for work.
  9. Work with challenging code
    • Work with code that hurts the psyche; it will help cultivate consideration for others when working on greenfield code.
  10. GitHub/bitbucket/gitlab
    • This is simply the standard for collaborating on code; it's also a pleasure to use.

Development

  1. Web development
    • More and more software ends up on the web; get to know how it works.
  2. Keyboard shortcuts
    • Boilerplate and mouse usage are time consuming.
  3. Databases
    • Do work with relational databases. Bonus points for graph and nosql databases.
  4. Language mastery
    • Get to know a language really well.
  5. Google
    • You're gonna google a lot. If 'googling it' isn't your first reaction to a problem, you're not there yet.
  6. Unit tests
    • A follow up to reproducible work. TDD is great. At least provide unit tests using a suitable testing library for the language you work with.
  7. Basic Object Oriented
    • You will run into this. Know the difference between a method and a function. Knowing inheritance helps.
  8. Basic Functional programming
    • You will run into this. Know about side effects, immutable data, passing functions.
  9. Basic software design & architecture
    • Requirements gathering, identifying key components, factoring common functionality, identifying key data structures, identifying users and integrations. It may take just a day or many weeks depending on the complexity of the project. Diligence here sets a project on tracks to success.
  10. Packaging code
    • Learn to bundle applications according to the standard for the language you work with & the use case (e.g. binaries, web service)
  11. Leverage libraries
    • Do not duplicate work. If something you need to do is implemented elsewhere, and the implementation is reliable, use the other implementation. If you have to bend your mind and struggle to figure out how to make it work for your use case, it's probably from not looking at your problem in an abstract way, not because of a lack of suitability of the general solution
  12. Performance
    • Learn to profile code using language standard tools and implement suitable optimizations.
    • Multi-threading/processing are great performance wins after easy optimizations are made for single threaded/core.
  13. Debugging
    • I wrote code for 4 years before learning what an interactive debugger was and how to use it. Don't delay debugging.

Shipping

  1. Continuous integration/continuous deployment
    • software is always changing; make sure users can get the latest changes every time code is pushed to remote version control.
  2. Docker
    • It's hard to summarize the value of docker until you use it. Once you go docker you'll never go back.
  3. Cloud infrastructure
    • Choose a cloud provider and learn to deploy there; AWS, azure, google are good options.

Conclusion

The initial 3, python, Linux, and git were instrumental on setting me on a good path. If you're on the road to data science software development, consider starting there.

If you're well on the way, or a bona fide success, can you check the 32 other items? Is there any you would add?

Go Top
comments powered by Disqus