Data janitors wanted.

Enough of this “data scientist” nonsense.

Nick Huber
6 min readJun 5, 2021

I recently came across an epic analytics post from 2016 out of StitchFix on why Engineers Shouldn’t Write ETL. Beyond just the titular role, it dissects the common but problematic dichotomy in many data orgs where “data scientists” are the “ideas people” and “data engineers” are the “implementers.” It’s well worth reading in entirety, but one of my favorite parts of it is not only lucidly explaining why this division of work is untenable, but also offering an explanation for why this characterization of the data scientist role came about. Simply:

[T]he role sounds really nice, and it’s easy to recruit for.

Since obtaining the “sexiest job of the 21st century” in 2014, which may have been pitched to me in these terms (I don’t remember anymore tbh), I’ve interviewed ~100 data candidates, reviewed 1000s of profiles, worked with scores of data scientists, engineers, and analysts on dozens of projects and initiatives in organizations in the US and across Southeast Asia. As part of this journey, I’ve often found myself having to clarify expectations about what the role of a “data scientist” really is, both internally to candidates, and externally to business users/clients.

On one hand, I’ve had a final round interview with a candidate who was finishing her master’s in data analytics who could fit a model but didn’t know what the definition of accuracy was, even after 2–3 chances and a few hints. On the other, I’ve had executives who read an article on operational AI and expect it to be up and running in their orgs in 3 months in which they don’t even have a modern data warehouse yet. In both cases, while I’m grateful for the continued investment in my field, I believe the core issue is a misunderstanding of the real responsibilities of a “data scientist.” Though it’s still a work-in-progress, I’m hoping this post will clarify how I see the role.

In my head, the real role of a data scientist — or, data janitor, as I find more accurate and increasingly prefer (at first, jokingly) — can be decomposed into the following main responsibilities:

  • Translate business questions into analytics code and data sources. Very rare is the organization that doesn’t have any kind of analytics and reporting set up and so too the businessperson who doesn’t use some kind of reports to make decisions. More commonly, the granularity, frequency, breadth or depth of these reports and analyses leave too many questions unanswered. Perhaps we’ve measured adoption in some simple way for 2 years, but it’s no longer offering insight into user behavior. Or Sales’ definition of sales is wholly out of whack with how Marketing thinks about it. Or the monthly utilization report is often delayed and the latency of it doesn’t allow for time to change pricing or availability. Or we can’t split our most important business metrics (revenue, new users) by some important dimensions (market, channel). Sometimes, a lucky data janitor will get these needs directly from a decisionmaker; other times, they need to be teased out. In either case, they’ll never be fully formed enough to start collecting data and writing code, so the data janitor must engage multiple groups consultatively to both refine the business need and cobble together where the best available data on the subject is.
  • Deeply understand/grok data at a source system level. Once a data janitor has defined the business need and mapped out the best available data sources, she must deeply understand her data at a source system level. While the data scientist quickly throws his hands up when data availability or data quality issues arise, the diligent data janitor instead triangulates the business need via alternative, more readily available data sources and seeks to understand why the source data is dirty in the first place and what can be done about it. I once had to traverse, follow-up, cajole, and charm an IT organization for 2 months to get data on a channel that accounted for 30% of a business’s sales. That 30% number btw we only realized after we got the data ofc, and no one had been tuning the channel in years since its development — as no reports on its performance had been built! I’ve gone deep into cold storage Hive tables on Javascript events to understand programmatic competitor behavior on a platform, whose only traces was visible in its API calls to our servers. I’ve worked with large, publicly traded businesses that have misreported aspects of top-level business metrics to their Boards of Directors, due to faulty and incomplete data pipelines and inadequate reports. But all of this has only been possible because I was willing to dive deep to grok data at a source system level, whereas others were satisfied with surface-level explanations. Seemingly, no one wants to talk about this kind of data work but it is the true calling of a great data janitor.
  • Define metrics that accurately capture business outcomes. A data scientist will impractically define business metrics from data sources that don’t exist yet. The humble data janitor will understand that decisions need to be made in actual, lived reality and tailor the best metrics that can be made from the currently available data, while adding to her backlog to implement more and better logging. These metrics may take a few iterations to achieve adoption by the business/product teams, but eventually they become embedded in those teams’ language and ways of thinking. They can further be regularly monitored and further split by dashboards/business intelligence tools, which the noble data janitor proactively builds before her business teams even ask. Rather than allowing her team’s mental workspace to become disorderly, a good data janitor regularly purges old metrics that are no longer relevant.
  • Propose and implement initiatives to drive key business metrics. While the data janitor knows that her primary job is to empower others with the knowledge to make better decisions, she still understands she has a direct hand to play in this effort as well. When there is a business problem that could benefit from predictive modeling, a data janitor will judiciously choose one that allows for quick feasibility assessment, future iteration, and eventually seamless in-product/-experience integration. When the teams she supports start wondering deeply about the causal effects of their recent product changes on the metrics she helped defined, she calmly evangelizes good experimental design and suggests a few specific experiments for the team to run in the next month. As their journey towards experimental design furthers, she eventually becomes the fountainhead for the most critical, strategic experiments that the team runs to drive improvements in the team’s core metrics.
  • Systematize and maintain this work. The data scientist leaves in his wake undocumented PowerPoints, uninterpretable word clouds, Jupyter notebooks that can’t be run top to bottom, and experimental models which he proudly claims have state-of-the-art results but no business impact or relevance. In contrast, every 6–12 months, the data janitor reviews her main areas of responsibility and proactively searches for and adopts tools that increases her overall effectiveness. If she spends too much time on maintaining data transformation code, she starts leveraging dbt models. If her queries start to time out, she nudges the org towards a modern data warehouse like BigQuery, Snowflake or Redshift. She is more passionate about how the full-stack engineers log their events in Segment/Rudderstack/mParticle than they are, and helps organize consistent patterns for the team to do so. She proactively looks for and builds solutions on data quality monitoring, model development/deployment, and modern business intelligence tools. Fundamentally, she enjoys the core journey of exploration required in her data work, but understands that if everything goes as she wishes — and the business adopts her recommendations — the duty to build, maintain and improve systems for these is sacred, business-critical, and non-optional. She judiciously experiments and systemizes like a great full-stack engineer, always with characteristically deep intuition for when to do which.

Fundamentally, the demand for folks with the right attitude, motivations, and these set of data skills is only growing. The cost of storage and compute continue to go down, increasingly making the rate-limiting step in most cases the ability to hire and cultivate good data analysts, engineers, and scientists — or data janitors, if you will allow me. Whatever you want to call them, good analytics work yields quantifiable business impact — in the form of cost savings, incremental revenue, growth, or engagement — and always brings up more future analytics work and questions than it is able to by itself address.

But, in the spirit of directness and to poke the bear a bit, I say, let’s rid ourselves of this highfalutin “data scientist” nonsense and just call it how it is: data janitors wanted.

--

--

Nick Huber

Hi, I’m Nick! 👋, a self-taught data scientist 📈, programmer 🖥️, and part-time investor 💵. VP at Thinking Machines, prev data science at Airbnb, Quora, FB.