Posted on

Detecting offensive words with Mistral AI 7B

When working on a simple one-time passphrase generator, I stumbled upon the issue of offensive words coming up in the output displayed to users. To solve this problem, I needed a way to detect and filter them out automatically.

A dictionary
A robot censoring books (by Stable Diffusion)

Do you really need an LLM for that?

The problem of filtering out offensive words isn’t exactly new. A simple way of solving it would be to use a blacklist of offensive words. There are already quite a few people who did that around the internet. Sadly building an exhaustive list of offensive words is harder than it seems. In most cases, it took me less than a minute to find insults that bypassed those blacklists.

There are also statistical models like profanity-check, which does not use blacklists and should filter a much larger amount of words. But, from my tests, it also does not take very long before you can find words that go through the check but shouldn’t. I think these models may be performing better on whole sentences than single words.

On the other hand, LLMs were trained over an insanely huge text corpus. While I remain skeptical of the claims that LLMs will take over the world, it seems pretty evident that they are excellent tools for natural language processing and should be able to detect offensive words.

Introducing Mistral-7B-Instruct-v0.1

While everyone seems to be using GPT-4, I chose to not follow the wisdom of the crowd and check out Mistral 7B. Since those guys raised a 105m€ seed round, it has to be at least decent. The main selling point compared to GPT-4, is that it is open-source, which guarantees that the model should remain usable even if the company behind it goes under.

I quickly gave up on trying to run it on my laptop, but instead chose to use Cloudflare Workers AI. This service lets you set up serverless LLMs on Cloudflare’s infrastructure. This removes most of the operational complexity, for very little cost.

I decided to use the instruct version of the model. This version was finetuned to follow instructions, so it let us ask it to generate outputs in the format we want. For example, we can ask the model to only reply with “yes” or “no”, which is easy enough to parse.

A dictionary
Photo by Joshua Hoehne

The following prompt will check if a given word is offensive or not:

const messages = [
  { role: 'system', content: 'You check if words are offensive, reply using yes or no' },
  { role: 'user', content: word },
];

From this, I built a service that takes a list of words as input and only returns the good ones. It is merely 65 lines of TypeScript, as most of the logic is handled in the LLM’s black box. It can be queried with a POST request, like this:

~ ❯ curl -X POST \
     -H "Content-Type: application/json" \
     -d '["elephant", "murder", "tulip"]' \
     http://localhost:8787

["elephant","tulip"]

Turning that into a dictionary

Cloudflare’s Workers AI is currently in beta, so the API changes all the time, the rate-limiting is not very straightforward and it sometimes throws 500 at you. Calling it directly from a production environment seems out of the question. Thankfully, the English dictionary doesn’t change too often, so we can just check everything once and build a new “clean” dictionary from that.

The dictionary took a couple of hours to generate, as I tried to stay below the imposed rate limit. It is not perfect, but it contains 355424 words (from 370105 in the source dictionary). You can see the result as well as the full code source for this article on GitHub.