August may be perceived as the month where France shuts down for the summer. Yet, just before the summer ’23 holiday, the French Data Protection Authority (“CNIL”) published several call to action for the various players of the data ecosystems in general and in artificial intelligence (AI) in particular, following its 16 May 2023 announcement of an AI action plan:
- Opening and re-use of publicly accessible data – The CNIL published a draft guidance on the such data usage, and all stakeholders are invited to weight in until 15 October 2023 before its finalization. While non-binding, this guidance is expected to lead the way on how the EU’s Supervisory Authority will apprehend and enforce the General Data Protection Regulation (“GDPR”) when personal data is scraped from online sources and subsequently used for subsequent purposes. This notably focuses on Art. 14 GDPR and the indirect collection of personal data and specific prior information requirements. Artificial Intelligence is explicitly mentioned by the CNIL in the draft, as such data, which feeds large-language models, “undeniably contributes to the development of the digital economy and is at the core of artificial intelligence.” Stakeholders are invited to submit their observations online through the dedicated portal.
- Artificial Intelligence Sandbox – Following in the footsteps of its connected cameras, EdTech & eHealth initiatives, the CNIL is launching an AI sandbox call for projects, where stakeholders involved in AI in connection with public services may apply to receive dedicated assistance by the regulator to co-construct AI systems complying with data protection and privacy rules.
- Creation of databases for Artificial Intelligence uses – Open to the broadest possible array of stakeholders (including individuals), this call for contributions notably addresses the specific issue relating to the use of publicly accessible data and aims at informing the CNIL of the various positions at play and how to balance GDPR’s requirements (information, legitimate interests, exercise of rights) with data subjects’ expectations. Stakeholders are invited to submit their observations online through the dedicated form (in French – our free translation in English is available below)- no deadline for submission has been set.
French Data Protection Authority – Work on database creation for Artificial Intelligence
(Travaux sur la constitution de bases de données pour l’intelligence artificielle – CNIL)
Published on 07/27/2023 – Free Translation
Introduction
To support its work, and in order to benefit from the practical and operational expertise of AI players, the CNIL would like to gather contributions from all stakeholders on several points structuring its analysis.
How to contribute?
The CNIL invites you to write your responses in the form of concrete examples that will be particularly useful in supporting its future recommendations.
Contributions can be sent to ia@cnil.fr.
It is not necessary to answer all the questions raised in the questionnaire. In fact, we recommend that you provide targeted contributions to the questions listed below, in which you have specific legal, technical or operational expertise.
What information will be collected?
Contributors are invited to introduce themselves in order to contextualize their contributions (name and category of organization: researcher, association representing civil society, public administration, private company, developer or user of AI systems, etc.).
The CNIL processes the data collected in this way in order to analyze the contributors’ observations with a view to adopting a position on the subjects concerned. Data is also collected to produce statistics relating to contributions and, if necessary, to contact contributors in order to deepen exchanges or keep them informed of the consultation’s follow-up.
You can :
- access your data ;
- object to the processing of your data;
- request rectification or deletion.
- exercise your right to limit the processing of your data.
Find out more about data management and your rights.
Please note: in your contribution, indicate any elements protected by literary or artistic property rights (in this case, specify whether or not you authorize communication), or by business secrecy.
All contributions received by the CNIL may be the subject of a request for access as administrative documents (Code des relations entre le public et l’administration). However, the CNIL is not obliged to follow your assessment of what is or is not protected.
Information sur votre organisme
Organization name :
Category of organization (researcher, association representing civil society, public administration, private company, developer or user of AI systems, etc.) :
Ensure that the purpose is specific, explicit and legitimate
Any processing of personal data must pursue a specific, explicit and legitimate purpose (or objective), made known to the persons concerned. The case of general-purpose AI and foundation models raises questions about the purpose of these generic systems.
The CNIL is interested in receiving contributions on these questions:
- How do you define the purpose of processing aimed at training an AI model when the intended operational use is not unique and precisely identified (e.g. in the case of generative AI development or general-purpose AI systems)?
- Do you think it’s relevant to refer to the model’s capability or task (e.g. facial recognition, object detection and classification, segmentation/partitioning or clustering)?
- Do you think it’s relevant to refer to the model’s intended or possible reuse (e.g. commercial exploitation, scientific research, open or closed source distribution)?
- Do you think it’s relevant to refer to known or conceivable use cases for the AI system (medical diagnostics, autonomous driving, etc.)?
- In which cases do you feel that the AI system’s design has a scientific research purpose? If the AI system developed for scientific purposes is also marketed, how can we distinguish in law and in practice between research and commercial purposes?
- In what cases and under which conditions does the AI system’s design seem to you to have a statistical purpose?
Selection and minimization of data
Any processing of personal data must comply with the principle of data minimization. In addition, the creation of AI databases always involves data selection and filtering, to guarantee the performance of the models trained (e.g. deduplication) or to avoid processing particularly risky data (e.g. credit card numbers).
The CNIL welcomes contributions on state-of-the-art practices and minimization measures:
- For a given example, what data selection methods are in place?
- Prior to collection (scope, filters, etc.)?
- At the time of collection (by excluding irrelevant data, for example)?
- After collection (through anonymization and pseudonymization measures in particular)?
- What are the constraints on this collection phase, and what risks have been identified?
- In your opinion, what are the best practice measures to be implemented concerning:
- Checking and improving data quality (random review of data and annotations, cross-validation procedures for annotations, automated technical processes, etc.)?
- Measuring and improving data representativeness (on the diversity of situations, people, conditions of use of the AI system such as statistical tools, tools for building representative data subsets for learning, etc.)?
- Validation of the most suitable method for the task in hand (comparison between systems based on learning or not, in-house or outsourced development such as the use of pre-trained models, including a weighing-up against the cost of data collection and compliance, etc.)?
- Selection of the categories and volume of data required for learning (comparison of performance obtained by removing certain variables or reducing the volume of data used, principal component analysis, etc.)?
- Data collection techniques (harvesting, scraping, using an API, downloading a file, etc.)?
Adopt an approach that respects the principle of data protection by design and by default
The CNIL wishes to identify the technical, contractual and organizational resources needed to implement the data processing in compliance with the principle of data protection by design and by default.
At this stage, CNIL has identified the following techniques and measures:
- Synthetic data.
- Federated training.
- Secure multi-party computing.
- Anonymization or pseudonymization (thanks to differential confidentiality, for example).
- Use of a data intermediary or trusted third party to implement protective measures such as anonymization or the exercise of rights.
- Licensing for reuse of AI datasets and models.
- Secure execution environments.
- Certain machine unlearning techniques.
- Homomorphic encryption for learning.
- Setting up an ethics committee.
The CNIL would like to receive contributions on operational implementations of this type of measure, based on concrete examples:
- For what applications and on what types of data have these techniques been applied?
- What are the conditions for successful integration of these techniques (technical environment, governance, etc.)?
Legitimate interest: balancing rights and interests
In the event that the processing involved in setting up the AI database or in configuring (“training”) the AI model is based on the legitimate interests of the data controller, it is important to ensure that the processing does not infringe the rights and interests of the people whose data is being processed.
The CNIL would welcome contributions on the various steps involved in assessing this point:
- What are the consequences that processing aimed at building a database for AI and/or model training may have on the data subjects?
- In your opinion, does such processing constitute an invasion of privacy? In what cases?
- In your opinion, can such processing have an impact on other fundamental rights (freedom of expression, freedom of information, freedom of conscience, etc.)?
- In your opinion, can this processing have other concrete impacts (services accessible to the user, exclusion of certain rights), in particular due to the automation and scale of information collection?
- In your opinion, what are the reasonable expectations of data subjects with regard to these processing operations and the data used to drive them?
- In which cases and for which categories of data do you think it is possible to consider that the constitution and use of the database for AI training falls within the reasonable expectations of individuals (particularly for databases composed of freely accessible data)?
- What are the limitations of AI purposes that might correspond to people’s reasonable expectations (for example, the fact that AI is only intended for research)?
- What are the relevant compensatory or additional measures that could limit the impact of the processing on data subjects?
- In terms of informing individuals (e.g. data traceability mechanisms in the event of re-use of pre-constituted datasets)?
- With regard to the exercise of rights (centralization or transmission of requests to exercise rights in the case of multiple players, mechanisms to facilitate the exercise of a right of access, for example by means of a search engine, etc.)?
- Data management (filtering, selection, anonymization, etc.)?
- Service design (e.g., adding modules to control model inputs and/or outputs)?
- Do you feel that publishing the AI model as an open source contributes to a better balance between the rights and interests involved? If so, to what extent and in which cases?
- In your opinion, in what cases could the exercise of individual rights be excluded (in particular the right to object, but also more broadly the right to modification, erasure, access, etc.)?
- Do you have any examples of publicly accessible data sets :
- To which you would like to draw the CNIL’s attention when drawing up its doctrine, because of the compliance difficulties they seem to present?
- Or, conversely, illustrate good practices that you would like to see included in CNIL recommendations?
Access the full text of the EU AI Act here.