ChatGPT Key Phrase tool - WildChat

Please note, it can take 20 to 40 seconds to load the data. Its set to “Language – English” by default

Where does the key phrase list come from?

The data is from WildChat. WildChat is a public dataset of around 1 million chat conversations on ChatGPT. These conversations were recorded when non-logged-in users use Hugging Face Spaces, where they could access ChatGPT 3.5 and the ChatGPT-4 API (other models are in the dataset but the majority of conversations are these models). The data ranges from April 9, 2023, to April 30, 2024. It is the only non-toxic data, hence the lower number.

Users have to agree to the usage, and it is pretty clear that their data and conversations will be recorded and used.

This data set is being used in a number of ways by academics, and you can read about it over at https://wildchat.allen.ai/. You should also read the paper at https://openreview.net/forum?id=Bl8u7ZRlbM.

Just a note, this data will also be used by other tool providers. There are a number of use cases, research data, helping to build out key phrases for LLM search tracking tools. In these cases, the companies will be conducting further processing on the data, extracting brand names, sentiment, and topics, for example. There is great value in that work, for now, I’m just looking at the raw data.

Is this the whole conversation the user had or just the first turn?

The phrases are just the first turn. The user might have asked ChatGPT several questions in the same conversation – we are just looking at the first turn they had.

How many key phrases are in the dataset?

There are 837,989 rows of data. Please note that some key phrases are featured twice. e.g. more than one person has said “Hello”.

Users might also be using the API and sending the same request multiple times.

What languages are included in the key phrase set?

There are 74 languages included, the top language is English. Here is a breakdown of the top 10.

English – 469425
Chinese – 113378
Russian – 83418
French – 26099
Spanish – 19045
German – 17217
Arabic – 11252
Portuguese – 9602
Maori – 8968
Nolang – 6752

What countries are included in the data?

There is data from 202 countries, with the top one from the United States. Here are the top 10 countries in the dataset.

United States – 170539
Russia – 113857
China – 101058
Hong Kong – 46987
United Kingdom – 30806
Germany – 28967
France – 26093
Japan – 17594
India – 16671
Canada – 16392

ChatGPT Key Phrase tool – WildChat

Where does the key phrase list come from?

Is this the whole conversation the user had or just the first turn?

How many key phrases are in the dataset?

What languages are included in the key phrase set?

What countries are included in the data?

Stay up to date with ROAST: