Can you trust your AI data? New standards emerge to help ensure transparency

10 mins

The Data Provenance Standards – the latest initiative from the Data & Trust Alliance – will deliver cross-industry metadata to enterprises, helping stakeholders make informed decisions about the data they source and use for AI applications.

If it feels like you’re drowning in AI, you’re not alone.

ChatGPT might only be a year old (and acting a bit like a toddler), but the wave of innovation it spurred is dominating every conversation. To complicate matters, the ecosystem dynamics are evolving on a daily… er, hourly basis.

One possible life raft on these stormy seas has emerged: data provenance. But before we drop anchor on that, let’s review the last few weeks of “AI headwinds,” shall we? And try to keep your head above water…

First, we had the epic drama at OpenAI as Sam Altman played musical chairs with Microsoft. The market is still unpacking what it means, but the hit left more than a black eye for both brands.

Synchronously, AWS opened the proverbial floodgates at re:Invent, the cloud juggernaut’s annual confab in Las Vegas. AI felt like the only show on the strip, with Amazon introducing a slew of new models, services, and its hyper-connective Amazon Q assistant.

Finally, not to be bested by its cloud rivals, Google just unveiled its highly anticipated Gemini AI last week, an advanced LLM built for multimodal reasoning across text, images, video, and code.

And that’s just scratching the surface.

Floodgates… drowning… is there a lifejacket on this boat?

Despite the EU’s landmark agreement to limit the use of artificial intelligence, we can all agree that the genie isn’t going back in the bottle. And with businesses set to increase their AI investments by almost 70% next year, the pros currently outweigh the cons.

Still, there are challenges aplenty. As more organizations dive into AI, big questions linger around accuracy, security, and governance. Between rampant hallucinations, unsavory biases, and other “shenanigans,” organizations are wrestling with existential issues as they lean more heavily on AI to crunch their data.

It’s worth pointing out that machine learning technologies have been around for quite some time, and data has always been the lifeblood. The recent AI acceleration is due in part to the advent of new LLMs, but also the growth of compute capabilities and cloud services to manage more of the heavy lifting. Now, enterprises can rapidly build and train their own AI models and leverage more datasets for critical use cases – like predicting diabetes based on procedure data from diagnoses and medical claims.

But what about the trustworthiness of the data they’re feeding to AI? Do they know where, when, and how it was collected? If they rely on third-party entities, how do they ensure the data they inherit is clean and complete?

This is one of the questions the Data & Trust Alliance is endeavoring to answer. The not-for-profit consortium is comprised of Fortune 500 businesses across multiple industries (IBM, Pfizer, Nielsen, Walmart, Meta, just to name a few) – all focused on learning, developing, and adopting responsible data and AI governance.

I recently spoke with Kristina Podnar, Senior Policy Director for the organization, about the Alliance’s proposed Data Provenance Standards. Kristina is an industry authority and one of the sharpest, most passionate voices on the subject of Digital Policy. She walked me through the proposed standards and the elegant role of metadata in delivering this essential layer of transparency.

An alliance built for trust

A bit of history for context: The Data & Trust Alliance was formed in 2020 by two heavy hitters: Ken Chenault, General Catalyst chairman and managing director, and former American Express chairman and CEO; and Sam Palmisano, former CEO of IBM.

In addition to helming some of the world’s largest brands, these pioneering luminaries have served on numerous boards – including Airbnb, NCAA, Berkshire Hathaway – and continue to be influential figures across industries.

As Kristina recalled, both Chenault and Palmisano observed that the next era of business won’t be dependent on data and AI, but on trustworthy data and AI. To help guide the responsible use of data and intelligent systems, they formed the Data & Trust Alliance – which resides within the Center for Global Enterprise (CGE), a New York-based non-profit.

The Alliance is built on a core belief that every enterprise in every industry is (or will soon be) a data enterprise, and must be responsible stewards of their data, algorithms, and AI. To support this mission, the organization doesn’t lobby or advocate on public policy issues; instead, it maintains a fierce commitment to developing tools and sharing practices to enhance learning.

“Unlike other coalitions that publish a lot of guiding principles, the Data & Trust alliance is all about actually creating tools, which member companies can adopt,” she said. “In the coalition are folks like American Express, Deloitte, Nike, Starbucks, AARP – the list goes on. These organizations get together around a table and ask, what are the biggest challenges we're facing right now? And then let's actually create a solution and adopt it.”

According to Kristina, there have been several initiatives that the Alliance has taken on since its inception. This includes a data assessment resource for mergers and acquisitions, and a toolkit for preventing or detecting bias in HR hiring practices when using AI. The latter predates NYC’s Automated Employment Decision Tool law – the first of its kind in the country, and a potential indicator of future legislation (the bill has also garnered criticism from a range of entities).

The Data Provenance Standards initiative

According to the Data & Trust Alliance, having clear data provenance – understanding the origin, lineage, and rights associated with data – is vital to establishing trust around the insights and decisions coming from data-enabled systems. This includes AI applications.

As it stands, data scientists spend nearly 40% of their time on data preparation and cleansing tasks, which is a huge drag on productivity. Further, 61% of CEOs cite the lack of clarity into data lineage and provenance as the central risks preventing the adoption of AI across their businesses.

All these factors led to the development of the Data Provenance Standards, which were co-created by 19 Data & Trust Alliance member organizations, along with multiple third-party industry partners and ecosystem experts.

The organization took a “for industry, by industry” approach when developing the proposed standards, identifying provenance pain points from 25 use cases across its diverse members. The resulting set focuses on the eight most essential, valuable, and feasible standards for small and large enterprises to implement – irrespective of industry.

“The interesting thing about these data provenance standards is they’re built on top of AI use cases,” Kristina shared. “But in reality, they apply to any data, including traditional data. It's the first set of standards really targeted at trustworthiness.”

Unlike other principles and industry guidelines that are voluntary in compliance, the Data & Trust Alliance’s provenance standards were designed to be actionable specifically for data governance practitioners. When implemented, they will provide transparency into the origin of the datasets used for both traditional data applications and a rapidly growing number of AI applications – which will ultimately enhance AI value and trustworthiness.

It’s all in the metadata

The word policy might give you the shivers, but that’s where the Data Provenance Standards provide an elegant solution: metadata.

“First of all, it's not policy,” Kristina explained. “They are a set of standards, and it's rich metadata that goes along with that. They extend the concept of metadata into a set of trustworthy considerations. At the root of all eight standards, there's a provenance metadata unique ID, and that unique label tracks the history and origin for each data set.”

As she further pointed out, the standards apply at the dataset level – not at the element row or table level – acting as an overall “fingerprint” for divining the lineage of the dataset.

The eight proposed Data Provenance Standards surface metadata and associated values for:

Lineage: Provides the provenance metadata unique ID, along with a blockchain ledger ID or other identifier.
Source: Contains information about the user, organization, software systems, and even IP address of the originator.
Legal Rights: Includes details about attribution, copyrights and trademarks, and any laws or regulations that are applicable.
Privacy and Protection: Covers any protected data classifications or if any privacy-enhancing techniques were applied.
Generation Date: Codes the year, date, month, and timestamp for the data.
Data Type: Delineates if the data is structured or unstructured.
Generation Method: Provides info on how the data was created, via web crawling, data mining, or other methods.
Intended Use and Restrictions: Details how the data should be used and by what audiences.

To see example values, you can download the Standards information pack from the Data & Trust Alliance website.

Helping enterprises make informed choices about data

Providing information about the origin and rights associated with datasets gives enterprises more confidence in the data they source and use. This can have significant benefits, like improving operational efficiency, regulatory compliance, collaboration, and value generation.

“If you think about the biggest challenge that corporations or any entity face today, everybody will tell you it’s data quality,” Kristina injected. “We were originally asked to develop a set of standards around data quality, but quality is very subjective. With our Data Provenance Standards, what we can all agree on is where data originated, and the attributes that it actually has.”

To illustrate the impact of the Data Provenance Standards, Kristina narrates a short video that outlines a hypothetical use case in the healthcare industry. The story focuses on a researcher at a fictitious company that is using predictive AI modeling to study Hepatitis C in young women. The researcher acquires data regularly from third-party sources and needs to ensure that the dataset is reliable.

Using the Data Provenance metadata, the researcher is able to inform her decision-making based on a number of key variables. By examining the lineage standard, she sees that the current dataset is derived from multiple sources. The generation date also indicates how fresh the data is and how much cleaning might be involved. With the intended use and restrictions, the researcher is also able to identify blockers for sharing the data with specific entities, such as the federal government.

It's a fantastic video and use case. I’ll let you watch it to hear the researcher’s verdict, but the point is clear: armed with this metadata, enterprises can rely on essential information and attributes to make a decision with greater confidence – and ultimately improve outcomes.

Why it matters for DX applications

It’s easy to see how the proposed Data Provenance Standards will impact the digital experience ecosystem. First and foremost: DX applications rely on data. The list is long, but for starters, think about acquiring outside datasets to power digital advertising and marketing initiatives.

“It will be interesting for third-party data procurement purposes,” said Kristina. “But I think in due course, it’s going to start to impact adtech and martech. It can't help but impact it.”

AI is also being woven across every layer of the digital stack, so enhancing transparency will be key to trust at every level – particularly in a decoupled, API-first world that taps lots of data sources. This includes CMS and DXP platforms that are becoming training grounds for AI and connecting with a myriad of external services driven by data.

The new standards are still in a “proposed” mode as the Data & Trust Alliance continues to gather feedback from the market. You can participate in the process by reviewing the standards and providing input on the metadata and values via an online survey.

As a resident Digital Policy rockstar, Kristina seems bullish on the potential impact – namely because the Data & Trust Alliance represents leading global brands that are committed to ensuring this vital transparency.

“The reason why this matters is because around the table are the largest procurement entities for data in the world,” Kristina said emphatically. “So if they start to adopt these standards and apply them to their data in the form of metadata, what you actually have is a huge ecosystem play.”

That ecosystem won't stop the AI floodgates from opening further, but it might help keep things afloat.

See Kristina Podnar at CMS Kickoff 2024 - January 16-17

CMS Critic is a proud conference partner at this second annual edition of the prestigious international Boye & Co conference, dedicated to the global CMS community. This event will bring together top-notch speakers, Boye & Company's renowned learning format, and engaging social events.

Tired of impersonal and overwhelming gatherings? Picture this event as a unique blend of masterclasses, insightful talks, interactive discussions, impactful learning sessions, and authentic networking opportunities. Prepare for an unparalleled in-person CMS conference experience that will equip you to move forward in 2024.

Along with Kristina Podnar, hear from leading voices across the industry on a wide range of topics, including:

Andrew Kumar, GVP at Uniform
Becky Brown, Content Strategy Lead at Johnson & Johnson
Marli Mesibov, Content Strategy at Verily

The Don CeSar: a world-class venue

Is there a better location for a winter kickoff than Florida's beautiful, sugary sand beaches? CMS Kickoff 2024 will be held at the iconic Don CeSar, just steps from the Gulf of Mexico. Dubbed the “Pink Palace,” this majestic hotel and resort provides a stunning backdrop to the conference, along with access to the local food and culture of St. Petersburg Beach.

Get your tickets today

CMS Kickoff offers an intimate, highly focused experience. Space is limited, and only a few seats remain. Don't miss this exclusive opportunity!

Purchase tickets now >

Industry

data

Industry