Please note, it can take 20 to 40 seconds to load the data. Its set to “Language – English” by default
Where does the key phrase list come from?
The data is from WildChat. WildChat is a public dataset of around 1 million chat conversations on ChatGPT. These conversations were recorded when non-logged-in users use Hugging Face Spaces, where they could access ChatGPT 3.5 and the ChatGPT-4 API (other models are in the dataset but the majority of conversations are these models). The data ranges from April 9, 2023, to April 30, 2024. It is the only non-toxic data, hence the lower number.
Users have to agree to the usage, and it is pretty clear that their data and conversations will be recorded and used.

This data set is being used in a number of ways by academics, and you can read about it over at https://wildchat.allen.ai/. You should also read the paper at https://openreview.net/forum?id=Bl8u7ZRlbM.
Just a note, this data will also be used by other tool providers. There are a number of use cases, research data, helping to build out key phrases for LLM search tracking tools. In these cases, the companies will be conducting further processing on the data, extracting brand names, sentiment, and topics, for example. There is great value in that work, for now, I’m just looking at the raw data.
Is this the whole conversation the user had or just the first turn?
The phrases are just the first turn. The user might have asked ChatGPT several questions in the same conversation – we are just looking at the first turn they had.
How many key phrases are in the dataset?
There are 837,989 rows of data. Please note that some key phrases are featured twice. e.g. more than one person has said “Hello”.
Users might also be using the API and sending the same request multiple times.
What languages are included in the key phrase set?
There are 74 languages included, the top language is English. Here is a breakdown of the top 10.
- English – 469425
- Chinese – 113378
- Russian – 83418
- French – 26099
- Spanish – 19045
- German – 17217
- Arabic – 11252
- Portuguese – 9602
- Maori – 8968
- Nolang – 6752
What countries are included in the data?
There is data from 202 countries, with the top one from the United States. Here are the top 10 countries in the dataset.
- United States – 170539
- Russia – 113857
- China – 101058
- Hong Kong – 46987
- United Kingdom – 30806
- Germany – 28967
- France – 26093
- Japan – 17594
- India – 16671
- Canada – 16392