Platform to monitor and filter verified AI datasets launched by MIT, Cohere for AI, and other collaborators

AI

MIT researchers, along with Cohere for AI and 11 other institutions, have launched the Data Provenance Platform to address the data transparency challenges in the AI industry. The team has conducted a comprehensive audit of nearly 2,000 widely used fine-tuning datasets, which have been downloaded millions of times and are crucial to many NLP breakthroughs. The audit includes tags to the original data sources, re-licensings, creators, and other data properties. To make this information accessible, they have developed the Data Provenance Explorer, an interactive platform that allows developers to track and filter datasets for legal and ethical considerations. It also enables scholars and journalists to explore the composition and data lineage of popular AI datasets.

In addition, the group has released a paper titled “The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI.” The paper highlights the challenges of treating dataset collections as monolithic entities, neglecting their lineage and multiple rounds of repackaging and re-licensing. This lack of understanding can lead to data leakages, unintended biases, and lower quality models. The paper also emphasizes the ethical and legal risks associated with incomplete documentation and model releases that contradict data terms of use.

Training datasets have been under scrutiny, and VentureBeat has extensively covered the issues surrounding data provenance and transparency. In one instance, Lightning AI CEO William Falcon criticized OpenAI’s GPT-4 paper for lacking transparency in architecture, dataset construction, and training methods. Furthermore, a deep dive into copyright issues in generative AI training data revealed the challenges posed by the use of copyrighted content without consent in training large language and diffusion models.

The launch of the Data Provenance Platform and the comprehensive dataset audit marks a significant step towards addressing the data transparency crisis in the AI space. By providing access to information about dataset sources and properties, developers, scholars, and journalists can make more informed decisions and ensure legal and ethical considerations are met.