Why regulators in Canada and Italy are digging into ChatGPT's use of personal information

As governments rush to address concerns about the rapidly-advancing generative artificial intelligence industry, experts say greater oversight is needed over what data is used to train the systems.

OpenAI's chatbot is trained using data scraped from the open web

Jason Vermes · CBC News · Posted: Apr 07, 2023 4:00 AM EDT | Last Updated: April 7, 2023

OpenAI has come under fire for its use of personal information in data used to train the artificial intelligence behind its chatbot software ChatGPT. (Dado Ruvic/Reuters)

As governments rush to address concerns about the rapidly-advancing generative artificial intelligence industry, experts in the field say greater oversight is needed over what data is used to train the systems.

Earlier this month, Italy's data protection agency launched a probe of OpenAI and temporarily banned ChatGPT, their AI-powered chatbot. On Tuesday, Canada's privacy commissioner also announced an investigation of OpenAI. Both agencies cited concerns around data privacy.

"You might say, 'Oh, maybe it feels a bit heavy handed,'" said Katrina Ingram, founder of Edmonton-based consulting company Ethically Aligned AI.

"On the other hand, a company decided that it was just going to drop this technology onto the world and let everybody deal with the consequences. So that doesn't feel very responsible as well."

Concerns about ChatGPT, transparency

Since it was released late last year, ChatGPT's ability to write everything from tweets to computer code has raised questions about its potential use in education and business. Similar AI products have been launched by Microsoft and Google in recent weeks.

These generative systems are trained to provide responses or generate output using data that is openly available on the internet — and it's not always clear what kind of information is included, experts say.

Smiling woman wearing a black cardigan and green t-shirt stands in front of a red brick wall. — Katrina Ingram is the founder of AI ethics consulting firm Ethically Aligned AI. She believes that greater oversight is need as AI products rapidly advance. (Jani Autio)

"One of the challenges right now is that I think we may not know enough about what's going on under the hood. An investigation can help to clarify that," said Teresa Scassa, Canada Research Chair in Information Law and Policy and a law professor at University of Ottawa.

The lack of transparency has prompted organizations and governments to call for a slow down — and even a pause — on launches of new generative AI projects.

OpenAI complied with Italy's request, and CEO Sam Altman tweeted, "we think we are following all privacy laws." European Union countries including France and Ireland have said they will examine Italy's findings on the issue, while Germany said it could block the service. Sweden has ruled out a ban on ChatGPT.

AI application ChatGPT temporarily banned in Italy over data collection concerns

Federal privacy watchdog probing OpenAI, ChatGPT following complaint

OpenAI published a blog post on Wednesday outlining its approach to safety and accuracy. The post also stated that "some" training data includes personal information. The data is not used to track users or advertise to them, but to make products more "helpful," according to the post.

The company said in the post that steps they have taken "minimize the possibility that our models might generate responses that include the personal information of private individuals."

Late last month, OpenAI said it fixed a "significant issue" that exposed some users' conversation history to a small subset of other users.

WATCH | Experts break down how AI could disrupt the workforce:

Is ChatGPT coming for your job?

2 years ago

6:30

With AI becoming more powerful, disruptive technology expert Joel Blit and PR executive Dara Kaplan break down how programs like ChatGPT will likely impact white-collar jobs and disrupt the workforce as we know it.

What data is scooped up?

Experts say there has been a lack of transparency around what data companies are using to train the large language models that underpin systems like OpenAI's ChatGPT.

According to Ingram, the systems are being trained with data that users have not specifically provided to the company. OpenAI says it uses a "broad corpus" of data, including licensed content, "content generated by human reviewers" and content publicly available on the internet.

"We didn't consent to any of this," Ingram said. "But as a byproduct of living in a digital age, we are entangled in this."

Information provided directly to OpenAI through ChatGPT may also be used to train AI, but that is disclosed in the product's terms of service, she said.

CBC News asked OpenAI questions about what is included in the data used to train their products. In response, they provided a link to the blog post published Wednesday.

'New version of an old controversy'

Black and white photo of a man wearing gray sweater. — Philip Dawson is head of policy for Armilla AI and a consultant on AI governance. (Philip Dawson/LinkedIn)

Philip Dawson, head of policy for Armilla AI — a tech company providing risk-mitigation products to companies using AI — says emerging concerns about data privacy in AI are a continuation of long-standing worries over online tracking by social networks and web companies.

"It's a new version of an old controversy. And it really calls into question some of the building blocks of large language models, which is really all about the vast amounts of data that these models are trained on and the computing power that enables that training," he said.

Dawson noted that companies are beginning to provide more information on the data sets used to train AI systems — especially as companies employing AI seek to avoid potential risk — but there's no requirement for them to do so.

Chatbot may provide inaccurate info

Whether sensitive personal data could appear in the output of a generative AI system is unclear. However, concerns have been raised about ChatGPT providing inaccurate information in response to queries.

In one example, an Australian mayor said on Wednesday that he may sue OpenAI if it does not correct false information shared about him by ChatGPT.

Brian Hood, the mayor of Hepburn Shire, became worried about his reputation after members of the public informed him that the chatbot named him as a guilty party in a foreign bribery scandal involving the Reserve Bank of Australia.

Lawyers representing Hood said that while he did work for the subsidiary, he was the person who notified authorities about the payment of bribes to foreign officials to win currency printing contracts.

OpenAI cautions that ChatGPT "may produce inaccurate information about people."

A shrouded face looks at a computer screen showing on screen messages. — ChatGPT is a chatbot that can answer written prompts. The artificial intelligence that underpins the product is trained on publicly available data scraped from across the internet. (Nicolas Maeterlinck/Getty Images)

Is a ban on AI needed?

There's already precedent for cases of internet data harvesting violating privacy law, said Scassa. In 2021, American technology firm Clearview AI violated Canadian privacy laws by collecting photos of Canadians without their knowledge or consent.

Part of the challenge for tech companies, regulators and consumers is that laws vary from one jurisdiction to the next. While an American company scraping online data to train large language models may be legitimate in the U.S., the same rules may not apply in Europe.

"We can have whatever law we want in Canada, but we're ultimately dealing with a technology that's coming from another country and that may be operating by different norms," said Scassa.

Smiling woman with red dangling earrings and a black and white blouse. — Teresa Scassa is Canada Research Chair in Information Law and Policy and a law professor at the University of Ottawa. (Submitted by Teresa Scassa)

Canada considers stronger rules on personal data use

A proposed Canadian law, Bill C-27, which is currently on its second reading in the House of Commons, aims to strengthen rules about how personal data is used by tech companies. The Artificial Intelligence and Data Act, tabled alongside C-27, would also require technology companies to provide documentation on how their AI systems are developed and report compliance to prescribed safeguards.

The EU is also developing a regulatory framework for artificial intelligence that outlines high- and unacceptable-risk use scenarios with the aim of protecting users.

Hit pause on AI development, Elon Musk and others urge

Q&A
Elon Musk has called for a 6-month pause on AI. This professor says it's not long enough

But many experts say that a ban on generative AI — or moratorium, like the one suggested last week in an open letter signed by a group of artificial intelligence experts, industry executives and Tesla CEO Elon Musk — is not necessarily the solution.

"I think a ban is a short-term solution at best," said Ingram, noting that a slowing on new product releases may be warranted.

"We need to speed up the regulatory process and move a bit faster on that front. And we need to have more conversations with stakeholders, including just regular people who are encountering AI in various ways in their daily life."

Addressing AI threats a challenge

On Thursday, in response to the ban by Italy's privacy regulator, OpenAI said it had no intention of putting a brake on developing AI, but it reiterated the importance of respecting rules aimed at protecting the personal data of citizens in both that country and the EU.

Until stronger regulations are in place, Scassa, the law professor, worries that addressing AI's potential threats will be a challenge.

"There is a need for government to put in place something that will structure our response so that we set legal parameters that will help us govern AI," she said.

"I certainly think that this is a very pressing issue that until we have those frameworks in place, it will be very difficult to respond to and to shape AI."

ABOUT THE AUTHOR

Jason Vermes

Journalist

Jason Vermes is a writer and editor for CBC Radio Digital, originally from Nova Scotia and currently based in Toronto. He frequently covers topics related to the LGBTQ community and previously reported on disability and accessibility. He has also worked as an online writer and producer for CBC Radio Day 6 and Cross Country Checkup. You can reach him at [email protected].

With files from CBC News and Reuters

CBC's Journalistic Standards and Practices|About CBC News

Corrections and clarifications|Submit a news tip|