<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.oliverguhr.eu/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.oliverguhr.eu/" rel="alternate" type="text/html" /><updated>2024-12-10T15:24:36+00:00</updated><id>https://www.oliverguhr.eu/feed.xml</id><title type="html">Tales of machines</title><subtitle>Tales of machines I met and loved, a personal blog about programming computers.
</subtitle><entry><title type="html">How to build a simple hate speech detector with machine learning</title><link href="https://www.oliverguhr.eu/nlp/jekyll/2019/08/02/build-a-simple-hate-speech-detector-with-machine-learning.html" rel="alternate" type="text/html" title="How to build a simple hate speech detector with machine learning" /><published>2019-08-02T13:00:00+00:00</published><updated>2019-08-02T13:00:00+00:00</updated><id>https://www.oliverguhr.eu/nlp/jekyll/2019/08/02/build-a-simple-hate-speech-detector-with-machine-learning</id><content type="html" xml:base="https://www.oliverguhr.eu/nlp/jekyll/2019/08/02/build-a-simple-hate-speech-detector-with-machine-learning.html"><![CDATA[<p>Not everybody on the internet behaves nice and some comments are just rude or offending. If you run a web page that offers a public comment function hate speech can be a real problem. For example in Germany, you are legally required to delete hate speech comments. This can be challenging if you have to check thousands of comments each day. 
So wouldn’t it be nice, if you can automatically check the user’s comment and give them a little hint to stay nice?
<!--description--></p>

<p>The simplest thing you could do is to check if the user’s text contains offensive words. However, this approach is limited since you can offend people without using offensive words.</p>

<p>This post will show you how to train a machine learning model that can detect if a comment or text is offensive. And to start you need just a few lines of Python code \o/</p>

<h2 id="the-data">The Data</h2>

<p>At first, you need data. In this case, you will need a list of offensive and nonoffensive texts. I wrote this tutorial for a machine learning course in Germany, so I used German texts but you should be able to use other languages too.</p>

<p>For a machine learning competition, scientists provided a list of comments labeled as offensive and nonoffensive (<a href="https://projects.fzai.h-da.de/iggsa/projekt/">Germeval 2018, Subtask 1</a>). This is perfect for us since we just can use this data.</p>

<h2 id="the-code">The Code</h2>

<p>To tackle this task I would first establish a baseline and then improve this solution step by step. Luckily they also published the scores of all submission, so we can get a sense of how well we are doing.</p>

<p>For our baseline model we are going to use <a href="https://fasttext.cc/">Facebooks fastText</a>. It’s simple to use, works with many languages and does not require any special hardware like a GPU. Oh, and it’s fast :)</p>

<h3 id="1-load-the-data">1. Load the data</h3>

<p>After you downloaded the training data file <a href="https://github.com/uds-lsv/GermEval-2018-Data">germeval2018.training.txt</a> you need to transform this data into a format that fastText can read.
FastTexts standard format looks like this “<strong>label</strong>[your label] some text”:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__label__offensive some insults
__label__other have a nice day
</code></pre></div></div>

<h3 id="2-train-the-model">2. Train the Model</h3>

<p>To train the model you need to install the fastText Python package.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>pip <span class="nb">install </span>fasttext
</code></pre></div></div>
<p>To train the model you need just there line of code.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">fasttext</span>
<span class="n">traning_parameters</span> <span class="o">=</span> <span class="p">{</span><span class="s">'epoch'</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span> <span class="s">'lr'</span><span class="p">:</span> <span class="mf">0.05</span><span class="p">,</span> <span class="s">'loss'</span><span class="p">:</span> <span class="s">"ns"</span><span class="p">,</span> <span class="s">'thread'</span><span class="p">:</span> <span class="mi">8</span><span class="p">,</span> <span class="s">'ws'</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span> <span class="s">'dim'</span><span class="p">:</span> <span class="mi">100</span><span class="p">}</span>    
<span class="n">model</span> <span class="o">=</span> <span class="n">fasttext</span><span class="p">.</span><span class="n">supervised</span><span class="p">(</span><span class="s">'fasttext.train'</span><span class="p">,</span> <span class="s">'model'</span><span class="p">,</span> <span class="o">**</span><span class="n">traning_parameters</span><span class="p">)</span>
</code></pre></div></div>

<p>I packed all the training parameters into a seperate dictionary. To me that looks a bit cleaner but you don’t need to do that.</p>

<h3 id="3-test-your-model">3. Test your Model</h3>

<p>After we trained the model it is time to test how it performs. FastText provides us a handy test method the evaluate the model’s performance. To compare our model with the other models from the GermEval contest I also added a lambda which calculates the average <a href="https://en.wikipedia.org/wiki/F1_score">F1 score</a>. For now, I did not use the official test script from the contests repository. Which you should do if you wanted to participate in such contests.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">test</span><span class="p">(</span><span class="n">model</span><span class="p">):</span>
    <span class="n">f1_score</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">precision</span><span class="p">,</span> <span class="n">recall</span><span class="p">:</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">((</span><span class="n">precision</span> <span class="o">*</span> <span class="n">recall</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">precision</span> <span class="o">+</span> <span class="n">recall</span><span class="p">))</span>
    <span class="n">nexamples</span><span class="p">,</span> <span class="n">recall</span><span class="p">,</span> <span class="n">precision</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">test</span><span class="p">(</span><span class="s">'fasttext.test'</span><span class="p">)</span>
    <span class="k">print</span> <span class="p">(</span><span class="sa">f</span><span class="s">'recall: </span><span class="si">{</span><span class="n">recall</span><span class="si">}</span><span class="s">'</span> <span class="p">)</span>
    <span class="k">print</span> <span class="p">(</span><span class="sa">f</span><span class="s">'precision: </span><span class="si">{</span><span class="n">precision</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
    <span class="k">print</span> <span class="p">(</span><span class="sa">f</span><span class="s">'f1 score: </span><span class="si">{</span><span class="n">f1_score</span><span class="p">(</span><span class="n">precision</span><span class="p">,</span><span class="n">recall</span><span class="p">)</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
    <span class="k">print</span> <span class="p">(</span><span class="sa">f</span><span class="s">'number of examples: </span><span class="si">{</span><span class="n">nexamples</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
</code></pre></div></div>

<p>I don’t know about you, but I am so curious how we score. Annnnnnnd:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>recall: 0.7018686296715742
precision: 0.7018686296715742
f1 score: 0.7018686296715742
number of examples: 3532
</code></pre></div></div>

<p>Looking at the <a href="https://github.com/uds-lsv/GermEval-2018-Data/blob/master/results.pdf">results</a> we can see that the best other model had an average F1 score of 76,77 and <strong>our model achieves -without any optimization and preprocessing- an F1 Score of 70.18.</strong></p>

<p>This is pretty good since the models for these contests are usually specially optimized for the given data.</p>

<p>FastText is a clever piece of software, that uses some neat tricks. If interested in fastText you should take a look the <a href="https://arxiv.org/abs/1607.04606">paper</a> and <a href="https://arxiv.org/abs/1607.01759">this one</a>. For example, fastText uses character n-grams. This approach is well suited for the German language, which uses a lot of compound words.</p>

<h2 id="next-steps">Next Steps</h2>

<p>In this very basic tutorial, we trained a model with just a few lines of Python code. There are several things you can do to improve this model. The first step would be to preprocess your data. During preprocessing you could lower case all texts, remove URLs and special characters, correct spelling, etc. After every optimization step, you can test your model and check if your scores went up. Happy hacking :)</p>

<p>Some Ideas:</p>

<ol>
  <li>Preprocess the data</li>
  <li>Optimize the parameters (number of training epochs, learning rate, embedding dims, word n-grams)</li>
  <li>Use pre-trained word vectors from the fastText website</li>
  <li>add more data to the training set</li>
  <li>Use data augmentation.</li>
</ol>

<p>Here is the full code:</p>

<script src="https://gist.github.com/oliverguhr/31a1c93a1005d7e6e04c23d389d89cb7.js"></script>

<p>Credit: Photo by <a href="https://unsplash.com/photos/IYtVtgXw72M">Jon Tyson on Unsplash</a></p>]]></content><author><name></name></author><category term="jekyll" /><summary type="html"><![CDATA[Not everybody on the internet behaves nice and some comments are just rude or offending. If you run a web page that offers a public comment function hate speech can be a real problem. For example in Germany, you are legally required to delete hate speech comments. This can be challenging if you have to check thousands of comments each day. So wouldn’t it be nice, if you can automatically check the user’s comment and give them a little hint to stay nice?]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.oliverguhr.eu/assets/posts/post-no-hate.jpg" /><media:content medium="image" url="https://www.oliverguhr.eu/assets/posts/post-no-hate.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>