When I spoke to the incomparable Kristina Podnar last November, she was – in a word – ecstatic.
And rightly so: the Data Provenance Standards, an initiative she helped guide into existence as Senior Policy Director at the Data & Trust Alliance (D&TA), was finally emerging from the chrysalis and taking flight.
The D&TA has become a leading voice in the chorus for trustworthy data and AI. Founded in 2020 by market mavens Ken Chenault and Sam Palmisano, the organization has been keenly focused on guiding the responsible use of data and intelligent systems.
If you’re not familiar with the Data Provenance Standards, I encourage you to read my previous interview with Kristina, who is an oracle on all things Digital Policy.
As we discussed in our conversation, having clear data provenance – understanding the origin, lineage, and rights associated with data – is vital to establishing trust in the details and decisions coming from data-enabled systems. This includes AI applications, which are exploding in popularity and expanding across numerous industry use cases.
The Data Provenance Standards were co-created by 19 member organizations of the Data & Trust Alliance, along with multiple third-party industry partners and ecosystem experts. Contributing players include American Express, Humana, IBM, Mastercard, Pfizer, and Walmart (to name a few) – obviously some of the biggest and most notable brands in the world.
Now, after testing and validation with more than 50 organizations both inside and outside the Data & Trust Alliance, the organization just released version 1.0.0 of its Data Provenance Standards. The announcement was detailed in a blog post by the brilliant Saira Jesani, Executive Director of D&TA – who I’ve also spoken to previously about this ambitious initiative.
As Saira noted in her overview, the entire journey started because the D&TA members and the broader business community were yearning for, as she called it, “better rules for the road around data quality.” Exacerbated by the race to adopt AI, she went on to frame how data has become the most sustainable source of competitive advantage. This revelation ultimately led to the creation of the Data Provenance Standards, and now to the vital feedback loop that will continue to improve the framework.
When first proposed back in November of 2023, Kristina was quick to point out that the new Data Provenance Standards were not a form of policy, but benchmarks accompanied by rich metadata. Functionally, this metadata would serve as a set of trustworthy considerations when validating the integrity of datasets.
At launch, eight proposed standards were included, each with a unique provenance metadata ID and a label for tracking the history and origin of each data set. It’s worth noting that the standards apply at the dataset level, acting as an overall “fingerprint” for divining details like the data’s lineage, source, legal rights, and more.
According to Saira, the standards were designed to facilitate business adoption via two key focus areas. The first is to demonstrate tangible business value by helping organizations assess which datasets are of higher quality. This includes delivering the transparency required for efficient, accurate, and reliable use of data – as well as mitigating the risks from legal, copyright, and regulatory compliance issues.
The second is making the implementation of the standards more streamlined and attainable, thereby encouraging adoption. This is achieved by homing in on the most essential metadata needed to reveal and understand a dataset’s origin, method of creation, and whether it can be legally used.
Since last November, we’ve seen significant changes across the regulatory landscape regarding data and AI, chiefly marked by Europe’s sweeping EU AI Act. As Saira noted, while this is happening at a macro level, we still don’t have standard definitions for its critical elements.
To call this a conundrum would be selling it short, to say nothing of the challenges surrounding the quality and integrity of datasets – which are essential for AI model training to deliver accurate and trustworthy results.
“Model evaluation is difficult, with little transparency around the data that trains and feeds those models,” she stated. “The consequences – from copyright infringement to privacy to authenticity – could affect both the technology’s business value and its acceptance by society, limiting organizations’ ability to determine what is to be trusted.”
There’s a lot at stake, and that’s why the creation of the Data Provenance Standards is an evolutionary process – one in which the D&TA is steadfast in its commitment to engaging with stakeholders, reviewing feedback, and modifying its approach to improve outcomes and further encourage adoption.
Since the Data Provenance Standards were first proposed, Saira and Kristina have been working diligently with the D&TA’s ecosystem of data, AI, ethics, and legal experts to enhance business adoption, gauge feedback, and continuously test and validate the standards. This has led to version 1.0.0, which includes multiple refinements.
First and foremost, the original eight categories have been simplified down to three standards, streamlining the adoption and facilitation. The accompanying metadata has also been revised, focusing on detail that “shows, not tells” – and giving data, AI, and legal teams stronger evidence to help inform critical decisions.
There’s also been a concerted effort to make D&TA queries more specific about Privacy Enhancing Technologies (PETs) to attain more accurate information. PETs are a broad set of tools and practices that enable organizations to build products and functionality while protecting the privacy of users’ data. Demonstrating how PETs fit into the equation further acknowledges D&TA’s commitment to transparency.
Along those same lines, v1.0.0 also sports refinements around surfacing consent language – in particular, the content shown to a user when collecting their personal data to assess risks associated with consumer data usage.
Here’s one of the biggest outcomes thus far: Thanks to persistent testing and validation over several months, D&TA has assembled a number of case studies, including one that has been released by IBM. This example illustrates how the Data Provenance Standards increased overall data quality and reduced clearance review time for datasets used to train AI models.
The Data & Trust Alliance is shepherding a flock of forward-thinking organizations that recognize the challenges – and opportunities – associated with data quality and integrity.
Market-leading brands represent a cabal of early adopters around the Data Provenance Standards, but their participation is also an incentive for their broad ecosystems to join the fold, if for no other reason than to do business with the Pfizers and Walmarts of the world.
Leading by example is perhaps an obvious force multiplier, but as Saira observed in her post, the D&TA’s goal is to increase transparency across the broader business scope – and encourage all organizations to take advantage of this free tool and realize the benefits of adoption.
Accelerating and expanding that adoption is next on the docket. While numerous data suppliers and producers of all sizes participated in the feedback cycle of the Data Provenance Standards, the organization is focused on fully enlisting them as partners in adoption and easing the process with toolset providers.
The D&TA’s innovative work is manifesting quantifiable benefits. Since the launch of the proposed Data Provenance Standards, the business value has become more palpable – particularly around the impact to productivity. As it stands, data scientists spend nearly 40% of their time on data preparation and cleansing tasks, which is a huge drag on productivity. On top of that, 61% of CEOs cite the lack of clarity into data lineage and provenance as the central risks preventing adoption of AI across their businesses.
Refinement is certainly a predictable expectation when launching a new concept like this. But already, the proof is in the provenance pudding – and the IBM case study is a poignant example of how these standards translate into real business value. Reducing friction and minimizing the barriers to testing AI models can accelerate time to market, all resulting from greater trust and confidence around datasets that have leveraged these standards.
In my previous conversations with Kristina (including a chat at the Boye & Company CMS Kickoff event in January), we discussed the potential impact of the standards on the content, ad tech, and marketing industries. All of these layers rely heavily on data, but AI has also infected every CMS and martech platform on the planet, both directly and indirectly. This cross-section is pivotal, as brands will increasingly rely on more trustworthy data to support their AI initiatives.
The content management ecosystem is at the foundation of how these technologies and campaigns will be managed. As such, the impact of the Data Provenance Standards could bring greater control and confidence to this sector of the market landscape.
In an economy where data is the most valuable currency – and every business across industries is now (or soon will be) a data enterprise – trust is paramount. As organizations race to leverage the potential of AI, good data is absolutely essential. This takes on a heightened imperative when we consider the kinds of initiatives that AI is supporting, like cancer or diabetes research.
This is precisely why the Data Provenance Standards are emerging as a key foundation for ensuring the future of data, AI, and – most importantly – people.
August 6-7, 2024 – Montreal, Canada
We are delighted to present our first annual summer edition of our prestigious international conference dedicated to the global content management community. Join us this August in Montreal, Canada, for a vendor-neutral conference focused on CMS. Tired of impersonal and overwhelming gatherings? Picture this event as a unique blend of masterclasses, insightful talks, interactive discussions, impactful learning sessions, and authentic networking opportunities.
January 14-15, 2025 – Tampa Bay Area, Florida
Join us next January in the Tampa Bay area of Florida for the third annual CMS Kickoff – the industry's premier global event. Similar to a traditional kickoff, we reflect on recent trends and share stories from the frontlines. Additionally, we will delve into the current happenings and shed light on the future. Prepare for an unparalleled in-person CMS conference experience that will equip you to move things forward. This is an exclusive event – space is limited, so secure your tickets today.