Democratizing Airbnb’s table-documentation practices

And how Dataframe can help. 🐳🔥

5 min readDec 15, 2020

First, a question:

What constitutes “good” data science?

If you were to ask your seasoned data scientist friends, you’d likely get a fantastic soundbite or two. Good data science is communication. Good data science is valuable. Good data science is creative. But from an organizational perspective, these aren’t exactly it.

✍️ Good data science is, first and foremost, good documentation.

Why? Because data science work is not just the final product. It’s about the context around it — the products and people using it, the decisions it drives, the insights it provides. It is science, after all, and the scientific method has careful documentation at its center. Imagine if Marie Curie had not carefully journaled her work on X-rays. No one would have believed her, and she certainly wouldn’t have received a Nobel Prize. The same standards around documentation should hold for data science. If a data scientist does not leave a paper trail explaining exactly what she did and how she did it, her work will incur considerable technical debt or, worse, become unusable.

But the typical approach to data science documentation is not great.

Let’s face it: our best efforts to document data science work usually fall flat. Here’s a typical strategy:

Add meticulous comments in Jupyter.
Push code to a central repo.
Write a detailed Google doc outlining the work and linking resources.

Completely reasonable, right? But a year later, what happens when this data scientist leaves the company? The person taking over the project goes through something like this:

✅ Find the relevant tables feeding production code.
❓Grep and jump through the code trail in search of what produces this code. Look for a relevant Google doc, to no avail.
😩 Contact the former employee directly asking for help.

Companies typically try to solve this by being more opinionated about where to store the learnings —Google docs, white papers, knowledge stores. But this often just adds fuel to the fire. These documents aren’t naturally part of any workflow process, so they tend to just drift off into the ether.

🗄️ Good documentation starts with table documentation.

When I was a data scientist at Airbnb, I loved that their internal tools helped shortcut this process by providing simple but essential table documentation. Once I encountered a table I needed to learn about, I was able to quickly get relevant context by jumping into Dataportal, Airbnb’s data discovery platform. I’d immediately know who owned the table, what it was for, and what the relevant columns were.

💡 And that’s when I realized how powerful just a little bit of context in the right place can be.

Simply knowing the owner of a table saved me from playing Slack tag for hours. Imagine if the documentation were even richer — if there were links to other docs, to the generating Airflow DAG right there as well. All the context needed to use this table would come for free. The workflow for anyone taking over this work would look more like this:

✅ Find the relevant table feeding production code.
✅ Search for this table in your data discovery engine, then follow the documentation.

Days, weeks, even months saved. Not just for you, but for EVERY. SINGLE. PERSON. in your organization.

🙃 So where should I document my tables?

Native documentation within the warehouse just doesn’t cut it. It’s generally standalone and plaintext, meaning you can’t add rich markdown, get notified about changes or integrate user-level metadata with any of your existing tools. On the other hand, modern documentation tools like Notion or Google Docs don’t cut it either — data is living and constantly changing, and gets out of sync with your docs way too fast. Airbnb’s pattern was right: documentation needs to live within a data discovery platform that is plugged into your data warehouse. People need to find and get context about their tables in the same place.

This is why we built our own data discovery platform: to democratize and fully develop the best practices we found in Airbnb’s data stack.

🐳 Dataframe is the most user-friendly way to set up a data discovery + data documentation platform.

We built our platform with one thing in mind: make everything easy. It’ll take you 5 minutes to get started, or less than a minute if one of your teammates has already connected Dataframe to your data warehouse. We’ve kept things exceedingly simple for our beta launch, focusing on what we believe is essential to promote good documentation habits: tags, owners, basic metadata like columns and partition info, and a full-fledged Notion-like README sitting alongside each table.

And the best part? These features will remain completely free. We at Dataframe are all former data scientists and data engineers, so we know how painful life can be when it’s hard to find what you need. This is an endlessly frustrating, headache-inducing problem with a stupidly simple solution: document your tables like you would your code. We built our platform to help your organization solve this quickly, so you can dedicate your time to genuinely interesting and impactful work. In other words, you can spend your time unlocking your true potential as a data scientist.

Dataframe recently launched into private beta earlier this week, and we’re trickling out a few beta invites every few days. Add yourself to our waitlist at dataframe.ai and join our Slack community to get started.