Fabric Lakehouse Governance & Data Lineage

Fabric Lakehouse Governance & Data Lineage

22 Minuten
Podcast
Podcaster
M365 Show brings you expert insights, news, and strategies across Power Platform, Azure, Security, Data, and Collaboration in the Microsoft ecosystem.
MirkoPeters

Kein Benutzerfoto
Stuttgart

Beschreibung

vor 3 Monaten

If you've ever wondered why your data suddenly disappears from a
report, or who exactly changed the source file feeding your
monthly dashboard, you're not alone. Most teams are flying blind
when it comes to seeing the full journey of their data.Today,
we're going to trace that journey inside Microsoft Fabric — from
ingestion, through transformation, into analytics — and uncover
how lineage, permissions, and the catalog work together to keep
you in control. By the end, you'll see every hop your data makes,
and exactly who can touch it.


Seeing the Invisible: The Path Data Actually Takes


Most people picture data traveling like a straight road: it
leaves the source, passes through a few hands, and ends up neatly
in a report. In reality, it’s closer to navigating an old
building that’s been renovated a dozen times. You’ve got hallways
that suddenly lead to locked doors, side passages you didn’t even
know existed, and shortcuts that bypass major rooms entirely.
That’s the challenge inside any modern analytics platform—your
data’s path isn’t just a single pipeline, it’s a web of steps,
connections, and transformations. Microsoft Fabric’s Lakehouse
model gives the impression of a single, unified home for your
data. And it is unified—but under the hood, it’s a mix of
specialized services working together. There’s a storage layer,
an analytics layer, orchestration tools, and processing engines.
They talk to each other constantly, passing data back and forth.
Without the right tools to record those interactions, what you
actually have is a maze with no map. You might know how records
entered the system and which report they eventually landed in,
but the middle remains a black box. When that black box gets in
the way, it’s usually during troubleshooting. Maybe a number is
wrong in last month’s sales report. You check the report logic,
it looks fine. The dataset it’s built on seems fine too. But
somewhere upstream, a transformation changed the values, and no
one documented it. That invisible hop—where the number stopped
being accurate—becomes the needle in the haystack. And the longer
a platform has been in use, the more invisible hops it tends to
collect. This is where Fabric’s approach to lineage takes the
maze and lays down a breadcrumb trail. Take a simple example:
data comes in through Data Factory. The moment the pipeline runs,
lineage capture starts—without you having to configure anything
special. Fabric logs not just the target table in the Lakehouse
but also every source dataset, transformation step, and
subsequent table or view created from it. It doesn’t matter if
those downstream objects live in the same workspace or feed into
another Fabric service—those links get recorded automatically in
the background. In practice, that means if you open the lineage
view for a dataset, you’re not just seeing what it feeds—you’re
seeing everything feeding it, all the way back to the ingestion
point. It’s like tracking a shipment and seeing its path from the
supplier’s warehouse, through every distribution center, truck,
and sorting facility, instead of just getting a “delivered”
notification. You get visibility over the entire chain, not just
the start and finish. Now, there’s a big difference between
choosing to document lineage and having the system do it for you.
With user-driven documentation, it’s only as current as the last
time someone updated it—assuming they remembered to update it at
all. With Fabric, this happens as a side effect of using the
platform. The metadata is generated as you create, move, and
transform data, so it’s both current and accurate. This reduces
the human factor almost entirely, which is the only way lineage
maps ever stay trustworthy in a large, active environment. It’s
worth noting that what Fabric stores isn’t just a static diagram.
That automatically generated metadata becomes the basis for other
controls—controls that don’t just visualize the flow but actually
enforce governance. It’s the foundation for connecting technical
lineage to permissions, audit trails, and compliance cataloging.
When you hear “metadata,” it can sound like passive information,
but here it’s the scaffolding that other rules are built on. And
once you have that scaffolding in place, permissions stop being
static access lists. They can reflect the actual relationships
between datasets, reports, and workspaces. Which means you’re not
granting access in isolation anymore—you’re granting it with the
full context of where that data came from and where it’s going.
That’s where lineage stops being just an operational tool for
troubleshooting and becomes a strategic tool for governance.
Because once you can see the full path every dataset takes, you
can make sure control over it travels just as consistently. And
that’s exactly where permission inheritance steps in.


One Permission, Everywhere It Matters


Imagine giving someone permission to open a finished, polished
report — only to find out they can now see the raw, unfiltered
data behind it. It’s more common than you’d think. The intent is
harmless: you want them to view the insights. But if the
permissions aren’t aligned across every stage, you’ve just handed
over access to things you never meant to share. In the Lakehouse,
Microsoft Fabric tries to solve this with permission inheritance.
Instead of treating ingestion, storage, and analytics as isolated
islands, it treats them like rooms inside the same building. If
someone has a key to enter one room, and that room directly feeds
into the next, they don’t need a separate key — the access
decision flows naturally from the first. The model works by using
your workspaces as the control point. Everything in that
workspace — whether it’s a table in the Lakehouse, a semantic
model in Power BI, or a pipeline in Data Factory — draws from the
same set of permissions unless you override them on purpose. In a
more siloed environment, permissions are often mapped at each
stage by different tools or even different teams: one team
manages database roles, another manages storage ACLs, another
handles report permissions. Over time, those separate lists drift
apart. You lock something down in one place but forget to match
it in another, or you remove a user from one system but they
still have credentials cached in another. That’s how security
drift creeps in — what was once a consistent policy slowly turns
into a patchwork. Let’s make this concrete. Picture a Lakehouse
table holding sales transactions. It’s secured so that only the
finance team can view it. Now imagine you build a Power BI
dataset that pulls directly from that table, and then a dashboard
on top of that dataset. In a traditional setup, you’d need to
manually ensure that the Power BI dataset carries the same
restrictions as the Lakehouse table. Miss something, and a user
with only dashboard access could still query the source table and
see sensitive details. In Fabric, if both the Lakehouse and the
Power BI workspace live under the same workspace structure, the
permissions cascade automatically. That finance-only table is
still finance-only when it’s viewed through Power BI. You don’t
touch a single extra setting to make that happen. Fabric already
knows that the dataset’s upstream source is a restricted table,
so it doesn’t hand out access to the dataset without verifying
the upstream rules. The mechanics are straightforward but
powerful. Because workspaces are the organizing unit, and
everything inside follows the same security model, there’s no
need to replicate ACLs or keep separate identity lists in sync.
If you remove someone from the workspace, they’re removed
everywhere that workspace’s assets appear. The administrative
load drops sharply, but more importantly, the chances of
accidental access go down with it. This is where the contrast
with old methods becomes clear. In a classic warehouse + BI tool
setup, you might have a database role in SQL Server, a folder
permission in a file share, and a dataset permission in your
reporting tool — all for the same logical data flow. Managing
those in parallel means triple the work and triple the
opportunity to miss a step. Even with automation scripts, that’s
still extra moving parts to maintain. The “one permission, many
surfaces” approach means that a change at the source isn’t just
reflected — it’s enforced everywhere downstream. If the Lakehouse
table is locked, no derived dataset or visual bypasses that lock.
For governance, that’s not a nice-to-have — it’s the control that
stops data from leaking when reports are shared more widely than
planned. It aligns your security model with your actual data
flow, instead of leaving them as two separate conversations. When
you combine this with the lineage mapping we just talked about,
those permissions aren’t operating in a void. They’re linked,
visually and technically, to the exact paths your data takes.
That makes it possible to see not just who has access, but how
that access might propagate through connected datasets,
transformations, and reports. And it’s one thing to enforce a
policy — it’s another to be able to prove it, step by step,
across your entire pipeline. Of course, having aligned
permissions is great, but if something goes wrong, you’ll want to
know exactly who made changes and when. That’s where the audit
trail becomes just as critical as the permission model itself.


A Single Source of Truth for What Happened and When


Ever try to figure out who broke a dashboard — and end up stuck
in a reply-all thread that keeps growing while no one actually
answers the question? You bounce between the data team, the BI
team, and sometimes even the storage admins, piecing together
guesses. Meanwhile, the person who actually made the change is
probably wondering why the metrics look “different” today. This
is the part of analytics work where the technical problem turns
into a game of office politics. Audit logs are Fabric’s way of
taking that noise out of the equation. They act like a black box
recorder for your entire Lakehouse environment. Every significant
action is captured: who did it, what they touched, and when it
happened. It’s not just a generic access log—Fabric ties these
entries directly to specific objects in the platform. So if a
dataset’s schema changes, you can see the exact user account that
made it, along with a timestamp and the method they used. Here’s
where the connection to lineage makes a difference. If all you
had was a folder of log files, you’d still end up manually
cross-referencing IDs and timestamps to figure out the impact.
But because Fabric already maps the data flow, those logs don’t
live in isolation. You can view a dataset’s lineage, click on a
node, and see precisely which actions were run against it. That
means you can trace a broken metric right back to the
transformation job it came from — and identify the person or
process that ran it. The coverage is broad, too. Fabric’s audit
layer records access events, so you know when someone queried a
table or opened a report. It logs creation and deletion of
datasets, pipelines, and tables. Modifications get a record
whether they’re structural, like changing a column type, or
procedural, like editing a pipeline activity. Even publishing a
new version of a Power BI report counts as an event, tied back to
its lineage. All of it gets the same treatment: time, user, and
object ID, stored in a consistent format. This uniformity is what
turns the logs into something usable for compliance. Regulatory
audits don’t care about your internal tooling—they care that you
can prove exactly who accessed sensitive data, under what
authorizations, and what they did with it. Fabric’s audit trail
can be queried to produce that history across ingestion,
transformation, and output. If an HR dataset is classified as
containing personal information, you can show not only the access
list but every interaction that dataset had, right down to report
exports. Incident investigations work the same way. Say a number
in a quarterly report doesn’t match the finance system. Instead
of speculating, you go to the dataset feeding that report, pull
its audit history, and see that two weeks ago a transformation
step was added to a notebook. The person who committed that
change is there in the log. You can verify if it was intentional,
test the outcome, and fix the issue without having to untangle
chains of hearsay. One of the underappreciated parts here is how
it integrates with Purview. While Fabric keeps the logs, Purview
can pull them in alongside the catalog and lineage data from
across the organization. That means the audit for a dataset in
one workspace can be looked at next to its related objects in
other workspaces and even non-Fabric data sources. For large
organizations, this stops investigations at the borders between
teams. Everything’s indexed in a single, searchable layer. When
you link logs and lineage like this, you get more than a record
of events—you get a timeline of your data’s actual life. You can
follow the route from source to report, while also seeing who
stepped in at each point. It’s a complete view that connects
human actions to data flows. That’s what saves you from chasing
people down in email threads or making decisions based on
guesswork. And beyond solving technical problems, this level of
visibility takes the politics out of post-mortems. You’re not
relying on memory or conflicting descriptions— you’ve got a
single, objective record. No matter how complex the pipeline or
how many teams touched it, you can back every claim with the same
source of truth. And once that visibility is in place, the
obvious next step is to scale it out, so that same clarity exists
across every dataset and every team in the organization. That’s
where the catalog comes in.


Purview: The Map Room for Your Data Universe


Knowing the lineage inside one workspace is useful — but it’s
also like knowing the street map of your own neighborhood without
ever seeing the city plan. You can navigate locally, but if the
delivery truck gets lost two suburbs over, you have no idea why
it’s late. That’s the gap between workspace-level insight and an
enterprise-wide view. And that’s exactly where Microsoft Purview
steps in. Purview sits above Fabric, acting like an index for
everything the platform knows about your data’s structure,
movement, and classification. Instead of digging into each
workspace separately, you get a single catalog that brings
lineage, definitions, and access rules into one place. It
aggregates metadata from multiple Fabric environments — and from
outside sources too — so your view isn’t limited by team or
project boundaries. The problem it solves is straightforward but
critical. Without a central catalog, each team’s view of lineage
ends at their own assets. The BI group might know exactly how
their dashboards are built from their datasets. The data
engineering team might know how those datasets were sourced and
transformed from raw data. But unless they’re trading notes
constantly, the full picture never exists in one system.
Troubleshooting, compliance checks, and data discovery all slow
down because you have to stitch fragments together manually. In
Purview’s catalog, lineage from ingestion to analytics is mapped
across every Fabric workspace it’s connected to. Imagine opening
a dataset’s page and not only seeing its lineage inside its
current workspace, but also the ingestion pipeline in another
workspace that feeds it, and the curated table two more steps
upstream. That’s not a separate diagram you have to maintain —
it’s read directly from Fabric’s metadata and preserved in the
catalog. From there, anyone with the right access can navigate it
like a continuous chain, no matter which logical or
organizational boundaries it crosses. One of the most tangible
benefits is search. Purview isn’t just indexing object names; it
understands classifications and sensitivity labels. If your
compliance officer wants to know where all data containing
“customer phone number” is stored or consumed, they can run a
query across the catalog and get every instance — in Lakehouse
tables, Power BI datasets, even Synapse artifacts. That search
works because Purview stores both the technical metadata and the
business metadata you’ve added, so “customer phone number” could
match a column in a Lakehouse table as well as a field in a
report’s data model. That connection to business glossaries is
where Purview goes beyond being a passive map. If you’ve defined
common business terms, you can link them directly to datasets or
columns in the catalog. It means that “Net Revenue” isn’t just a
label in a report — it’s tied to the actual data source,
transformation logic, and every report that uses it. For
governance, this reduces ambiguity. Different teams aren’t
debating definitions in chat threads; they’re all pointing to the
same glossary entry, which links back to the exact data objects
in Fabric. Integration with technical assets is broad and
consistent. Purview understands Power BI datasets, including
their table and column structures. It knows Lakehouse tables and
the pipelines feeding them. It registers Synapse notebooks, SQL
scripts, and dataflow artifacts. And for each asset, it keeps
track of lineage relationships and classifications. This makes it
just as easy to trace the origin of a KPI in a Power BI report as
it is to audit a transformation notebook’s impact on multiple
downstream tables. Centralizing all of this breaks down silos in
a practical way. With no single catalog, the security team might
only see logs and permissions for their own systems, while the
analytics team works in total isolation on reporting models.
Purview creates overlap — the catalog becomes the single
reference point for technical teams, analysts, and compliance
officers alike. It means a governance policy written at the
organizational level can be checked against real data flows,
instead of relying on assumptions or self-reported documentation.
And that’s the point where technical reality meets compliance
reporting. You’re not just drawing maps to satisfy curiosity.
You’re connecting verified lineage to actual usage,
classifications, and security rules in a way that can stand up to
audits or investigations. Whether the question is “Where is this
sensitive field stored?” or “Which reports depend on this table
we’re changing?”, the answer is in the catalog — complete,
current, and verifiable. With that kind of organization-wide
visibility in place, you can finally see how every piece of the
pipeline connects. Which raises the next challenge: ensuring that
transparency isn’t lost once the data starts changing inside
transformations.


Keeping Transparency Through Every Transformation


Every time data goes through a transformation, you’re removing or
reshaping something. Maybe it’s a simple column rename, maybe a
full aggregation — but either way, the original form changes. If
the system isn’t capturing that moment, you’re left with a number
you can’t properly account for. It still looks valid in a report,
but ask how it was calculated and you’ll find yourself digging
through scripts, emails, and memory to reconstruct what happened.
Inside Microsoft Fabric, this is where the Synapse transformation
layer earns its keep. Whether you’re working in SQL scripts,
Spark notebooks, or Dataflows, each step that changes the data
keeps its connection back to the original source. The Lakehouse
doesn’t just store the output table — it also knows exactly which
datasets or tables fed into it, and how they link together. Those
links become part of the lineage graph, so you can navigate both
the “before” and the “after” without guessing or relying on
separate documentation. The risk without transformation-level
lineage is pretty straightforward. You start trusting aggregates
or calculated fields that may be outdated, incomplete, or based
on incorrect joins. You can double-check the final query if you
have it, but that tells you nothing about upstream filters or
derived columns created three models earlier. This is how
well-meaning teams can ship KPIs that contradict each other —
each one consistent within its own context, but not rooted in the
same underlying data path. Here’s a simple scenario. You’ve got a
transaction table logging individual sales: date, product,
region, amount. The business asks for weekly sales totals by
region. In a notebook, you group by week and sum the amounts,
creating an aggregated table. In most systems, the link back to
the base table isn’t tracked beyond the notebook script itself.
In Fabric, that weekly sales table still appears in the lineage
graph with a live connection to the source transaction table.
When you click that node, you see where it came from, which
transformation objects touched it, and where it’s used downstream
in reports. That connection doesn’t fade after the job completes
— it’s part of the metadata until you delete the asset. On the
graph, each transformation appears as its own node: a Dataflow, a
Notebook, a SQL script. You can see both the incoming edges — the
datasets it consumes — and the outgoing edges — the tables,
views, or datasets it produces. This makes it obvious when
multiple outputs come from the same transformation. For example,
a cleansing script might produce a curated table for analytics
and a separate feed for machine learning. The lineage view shows
those two paths branching from the same point, so any changes to
that transformation are visible to the owners of both outputs.
What’s useful is that this scope isn’t limited to one type of
tool. A Dataflow transforming a CSV has the same kind of upstream
and downstream tracking as a Spark notebook joining two Lakehouse
tables. That consistency is possible because Fabric’s internal
service mesh treats these tools as peers, passing metadata the
same way it passes the actual data. The fact you built something
in SQL and your colleague built theirs in a visual Dataflow
doesn’t mean you need two different ways to see the lineage. This
automatic, tool-agnostic mapping turns an abstract governance
goal into something you can actually act on. Quality assurance
teams can audit an entire calculation path, not just the last
step. Compliance officers can prove that a sensitive field was
removed at a specific transformation stage and never
reintroduced. Analysts can check if two KPIs share a common base
table before deciding whether they truly compare like-for-like.
It’s not about policing work — it’s about trusting outputs
because you can see and verify every step that shaped them. In a
BI environment, trust is fragile. One unexplained spike or
mismatch erodes confidence quickly. When you’ve got
transformation-level lineage baked in, you can answer “Where did
this number come from?” with more than a shrug. You can click
your way from the report through each transformation, all the way
back to the original record. And when that degree of traceability
is combined with governance controls, permissions, and catalogs,
the result isn’t just visibility — it’s an entire data estate
where every decision and every metric can be backed by proof.
That’s what ties all of these capabilities together into
something more than the sum of their parts.


Conclusion


In Fabric, lineage, permissions, logging, and cataloging aren’t
extra features you bolt on later — they hold the Lakehouse
together. They work in the background, connecting every source,
transformation, and report with rules and proof you can actually
rely on. The clearer you see your data’s actual journey, the more
confidently you can use it without creating risk. That’s the
difference between trusting a number because it “looks right” and
trusting it because you’ve verified every step. Tomorrow, pick
one of your data flows. Trace it start to finish. See what’s
recorded — and what that visibility could save you.


Get full access to M365 Show - Microsoft 365 Digital Workplace
Daily at m365.show/subscribe

Kommentare (0)

Lade Inhalte...

Abonnenten

15
15