Sexual Harassment Stories Classification using NLP-1st part

Muhammad Iqbal bazmi
9 min readJun 12, 2021

Using the power of NLP to classify Sexual Harassment Stories.

Say NO to Sexual Harassment.

NOTE :

This Blog is divided into four parts.

  1. Introduction and EDA(Exploratory Data Analysis).
  2. Binary Classification Modeling using traditional Machine Learning(ML) and Deep Learning(DL).
  3. Multi-label Classification Modeling using traditional Machine Learning and Deep Learning.
  4. Creating Web-App using Streamlit

1. Introduction and EDA(Exploratory Data Analysis)

Table of Contents:

  • Description
  • Problem Statement
  • Real-World/Business Objectives and Constraints
  • Mapping to Mapping to Machine Learning
  • EDA(Exploratory Data Analysis)
  • Interpretation of Models
  • Tools and Technologies

Description

SafeCity is a platform where people share their Harassment Story. Safecity helps communities, police, activist, and government to take appropriate action according to the story shared online.

SafeCity is the largest platform in the world, which helps the community make our City Clean and better workspace.

SafeCity has its own technology stack that analyses anonymous crime reports shared online, identifying patterns, and key insights.

Stories shared online is the best source for citizens, researchers, policymakers to create safer spaces and take appropriate action by

  • Increasing awareness, accountability, and transparency
  • Improving policy and tactical precision with data-led insights.
  • Optimizing budgets to allocate resources more effectively

How SafeCity makes our City safer?

Sharing our experience helps them to identify patterns and create safer spaces. Take 2 minutes to submit an anonymous report and help make city more safer. Click Here!

Problem Statement

There are many categories of Harassment Stories shared online on the SafeCity platform like:

  • Verbal Abuse: Commenting, Catcalls/Whistles, Online Harassment, Sexual Invites
  • Non-Verbal Abuse: Ogling/Staring, Stalking, Taking pictures without permission, Indecent exposure, Masturbation in public
  • Physical Abuse: Touching/Groping, Rape/Sexual Assault, Chain Snatching, Petty Robbery, Human Trafficking
  • Other: Poor/No Street Lighting, etc.

Among all, three types of harassment are frequently occurred:

  1. Commenting(Verbal Abuse): #times=6,139
  2. Ogling/Staring(Non-Verbal Abuse): #times=3,605
  3. Touching/Groping(Physical Abuse): #times=4,878

Till now, we have narrowed down our problem statement.

We want to help victims to choose the category according to their story while sharing it online(run-time). Choosing the category will help SafeCity to take appropriate action.

In technical term: We want to classify the story(text) among three most popular Categories(Classes).

Question: How are we going to solve this problem?

Answer: Machine Learning(ML) and Natural Language Processing(NLP).

We will be using Traditional Machine Learning and State-of-the-art(SOTA) Deep Learning techniques to solve this problem.

TL;DR (Too Long Didn’t Read)

This is a Classification problem where we have to classify stories(text) among the three most frequent shared categories(Commenting, Staring/Ogling, Groping/touching). We will use ML and DL techniques to solve this problem.

Real-World/Business Objective and Constraints

  • Low-latency: low-latency is required because we have to suggest the tag in real-time. (latency is the time required to produce the classification result)
  • Interpretability: Interpretability is important. Because we don’t want to use our MODEL as black-box. (We will use feature-Importance and LIME). LIME stands for Local Interpretability Model-agnostic Explanation.
  • False Positives and False Negatives may lead to inconsistency to take appropriate action.

Mapping to Machine Learning

Data and Type of Machine Learning

Download dataset from this link: Click Here!

There are two types of data:

  • Binary Classification data
  • Multi-label Classification data
The directory structure of the dataset

Binary Classification data:

It is also known as Single-label Classification. This dataset contains two columns “Description”, and “Category”. Description contains text and Category contains the label of corresponding text(story), 1 means “it belongs to the particular category”, 0 otherwise.

Distribution of binary classification data
Example of Groping dataset

For each category, there are 7201 training samples, 990 development samples, and 1701 test samples.

Multi-label Classification data: Multi-label classification is the problem where there is more than one label for a particular data point.

Distribution of Multi-label dataset
Example of Multi-label dataset
Distribution of TRAIN, VAL, and TEST datasets. for multi-label classification.

Performance Metric

Now, after mapping to Machine Learning. We will solve our problem using Traditional Machine Learning and Deep Learning techniques(This part, we will discuss in later blogs). But, the main thing is to choose the best algorithms among all the algorithms.

Question: How would we judge our model to get the best among all?

Answer: We will judge our model using performance metrics.

There are many performance metrics to judge the model for both Binary(Single-label) and Multi-label classification.

Single-label Classification Performance metrics:

  • Accuracy score: #datapoints correctly classified divided by total #datapoints.
  • Precision: #true_positive(TP) predicted among all true_positive(TP)+false_positive(FP).(How precise the model is?)
  • Recall: Among all the positive(+ve) datapoints, How many of them are actually predicted as positive(+ve).
  • f1-score: Harmonic Mean of Precision and Recall.
  • ROC-AUC Score(Reciever Operating Curve-Area Under Curve): It is one of the metrics for Binary Classification. It works on Counting, not on actual probability. It is impacted by an imbalanced dataset.
  • Log-loss: Log-loss is also known as Binary Cross-Entropy. Log loss is based on probability score. Log-loss is hard to interpret, but it accounts for probability scores.

You can refer to this research paper:

Limitations of Research Paper: In the research paper, For binary classification, Accuracy is used as a metric. But accuracy is not a good metric to judge when data is imbalanced. In our case data is imbalanced.

Metrics used in my Solution: For Binary Calssification I used Precision and Recall as metrics to judge my best model. Recall is used as main metrics.

Multi-label Classification Metrics:

  • Accuracy(exact-match), Precision, Recall, f1-score, and AUC can also be used in Multi-label classification. But let’s understand, HOW?
  • Hamming Loss: The fraction of labels that are incorrectly predicted. Hamming loss is a good measure of model performance. lower the Hamming loss better the model performance. But, we will not be able to judge each label's performance.
  • Hamming score: The fraction of labels that are correctly predicted.

Hamming loss/Hamming score does this by using XOR operation between the actual and predicted labels and then average across the dataset.

  • Precision, Recall, and f1-score in Multi-label classification: In multi-label classification, we get values for each label. To get a single metric value, we can average using many techniques macro, micro, and weighted.

NOTE: click on the above link to get Hamming loss definition and Multi-label metrics.

macro : simply calculates the mean of the binary metrics, giving equal weight to each class. In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class.(credit: scikit-learn.org)

micro : gives each sample-class pair an equal contribution to the overall metric (except as a result of sample-weight). Rather than summing the metric per class, this sums the dividends and divisors that make up the per-class metrics to calculate an overall quotient. Micro-averaging may be preferred in multilabel settings, including multiclass classification where a majority class is to be ignored.

weighted : accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.

Metrics used in the research paper:

Hamming-score and Exact-match

Metrics used in my Solution:

Precision and Recall for each label.

EDA(Exploratory Data Analysis)

Analyzing data is the most important part of the whole Machine Learning Pipeline. Machine Learning Algorithms work on GIGO(Garbage-In-Garbage-Out) philosophy. If the data is garbage/random, then we will also get garbage results. So, getting clean and meaningful data is the most mandatory part.

So, Let’s try to understand the data…

  • There are two types of datasets, for binary and multi-label classification.
  • For binary classification, there are three separate datasets for each label. labels are Commenting, Ogling/Staring, and Groping.
  • Each dataset is divided into three parts: Train, Validation, and Test.
  • The distribution of Train, Val, and Test is 73%, 10%, 17%.

In a multi-label dataset, there are 8 classes if we find combinations using 3 labels. (as we discussed above)

Fortunately, we get our dataset clean and divided into Train, Val, and Test datasets.

So, Let’s just visualize the WordCloud for each label…

WordCloud: “bigger the text size, the more frequent it is present in the dataset”.

Commenting WordCloud(Credit:Author)
Ogling WordCloud(Credit: Author)
Groping WordCloud(Credit: Author)

In the above 3 WordClouds, “comment” is the most frequent in both the “Commenting” and “Ogling/Staring” datasets.

It means, our Models will also confuse to predict Ogling/Staring stories. Because “comment” is also the most frequent in Ogling, the model will also predict that as Commenting instead of Ogling(most of the time).

Other meaning of above problem is: doesn’t matter how complex algorithm we are using, but still we will get poor result, because of the dataset.

Interpretation of Models

After training all the models, we get our final model w.r.t our metric(in this case, recall). We don’t want to use our model as a black box. So, we will use LIME(Local Interpretable Model-agnostic Explanation). We can interpret any Model(both ML and DL) using LIME. Most of the time, LIME is very useful to interpret the behavior of Complex Deep Learning Models. Because, by nature, DL Algorithms are not interpretable easily.

See the example below:

Story(Using our Web-App using Streamlit)
Interpretation of above Story

I hope It is enough for this blog, Stay tuned for the next blog on Machine Learning Modeling, and Final Web-app using Streamlit and deployment on Heroku.

For Complete code: Github, Click here!

For a demo of the Web-App: Click Here!

If you have any queries, please comment. Thanks for reading this blog and stay tuned.

If you liked this post, please clap…

References:

  1. AppliedAICourse.com: For teaching in-depth Machine Learning and Deep Learning.
  2. SafeCity: For the research paper and the dataset.
  3. Krish Naik: For Heroku deployment and Github repository Management.

--

--

Muhammad Iqbal bazmi

A self-taught programmer, Data Scientist and Machine Learning Engineer