What is NLP Text Classification?
It’s a fact: human language is incredibly complex and diverse. In addition to speaking over 7,000 languages, humans don’t express themselves linearly. So, have you ever wondered how machines can understand human language?
Natural Language Processing (NLP) is a type of Artificial Intelligence that helps machines learn about unstructured human language. NLP tools process unstructured data and set a structure for a machine to know what to do with it. In other words, NLP helps machines understand, interpret, and manipulate human language. NLP has several practical uses, but today we’ll talk about one of them: Text Classification.
Text Classification: It is the act of assigning a predefined category to a text document. An example could be Gmail’s functionality in which it differentiates Spam emails from important emails. So, basically, what a text classifier does is defining one category (among a list of predetermined categories) to a free-text. Sounds easy, huh?
Easier said than done
The process of creating a text classifier isn’t simple at all. We can split it up into 4 parts:
1. Dataset Preparation:
The dataset is the collection of information that will train the machine. Let’s consider a project we’ve been working on. We recently launched a product called Ozmosys that categorizes news for teams. In this case, our dataset was a package of news. To train the machine, we needed to label the data. We chose 7 categories: Banking & Financial Services; Insurance; Legal Services; Life Sciences; Media; Real Estate; and Travel Industry. Around 70%-80% of the data used must be labeled by category for the machine to learn how to classify. The other 30%-20% is used to test if the machine got the labels right: if the results match the label, the machine is working.
2. Feature Engineering:
For each text on the dataset, it’s important to include features. These are a set of predefined characteristics the machine will need to take into account when classifying text. Features might include text length or the number of times the text includes a certain word.
3. Model Training:
Finally, the machine learning model is trained on the labeled dataset.
4. Improve Performance of Text Classifier:
It’s important to keep improving the text classifier to increase accuracy.
Some important tips (and challenges) for the dataset:
1. The labeled information with which the machine is trained has to be really similar to the unlabelled data the machine will need to classify on its own. So, if you train a machine with sports news and then apply it to political news, it won’t work.
2. The dataset must include text from all the predefined categories. It’s important to include a large amount of labeled data for each category so that the text classifier can correctly learn patterns and insights.
3. The dataset text must fully match one and only one category. This means that if you’re training your machine to differentiate between medical and political news, you shouldn’t train it with news about a politician being sick.
Unfortunately, some categories have overlap, which could confuse the machine. We faced this hurdle in designing Ozmosys since text from categories such as Banking & Financial Services and Insurance are likely to include similar keywords. We confronted this challenge by implementing subcategories. For instance, Banking & Financial Services was divided into subcategories such as Mortgages and Loans. So after our machine completed the first categorization, it applied a specialized model for each category.
It also may happen that, when the machine is running, there’s a certain text that doesn’t fit any of the established categories. In our example, we decided that the machine must categorize only when the fitting probability is >90%.
All in all, even though NLP text classification isn’t as simple as it looks like, studying it is definitely worth it. It’s a trending technology with multiple uses and exponential growth, which is already causing a big impact in several industries by accelerating and optimizing processes, improving UX, and automatizing jobs. Please comment on which other technology you would like to learn about.