No Labels, No Worries! Classifying Items by Name Instantly Using Text Embeddings

22 thg 4, 20244 phút đọc

Đã cập nhật: 7 thg 2

Oftentimes, we do not have the luxury of accessing labeled data. Fortunately, as text embedding models have become more sophisticated, we are now able to classify unlabeled text data quickly—within a few days or even hours—and at a remarkably low cost. In this article, I will demonstrate through a concrete example how you can efficiently label text data with many self-defined categories using embeddings and cosine similarity.

This example includes 4 sections:

Background: present the context and the relevance of the project
Objectives: specify the objective and related requirements for the project
Approach: list out some possible approaches and present in detail the chosen one - Embeddings with Cosine Similarity.
Result: detail the result, important observations, and total cost.

You can find the code used throughout this example here: Google Colab Notebook

1. Background

X is an emerging e-commerce platform specifically for Fashion items with many third-party sellers and thousands of SKUs. The company wants to build a classifier to classify each item into the correct category within a predefined 3-level category tree (categories file). However, the data they have available (items file) is unlabeled.

This matters a lot to X to build a reliable classifier because the company can:

Save their sellers time filling in the category for each item themselves, which can be excruciating with thousands of SKUs.
Improve their search engine & the clarity of the site's structure.
Thus, improve the customer experience in finding the most relevant items to their needs.

2. Objectives

Develop a machine learning model to accurately classify fashion items into appropriate categories within a predefined 3-level category tree.

Other requirements

Accurately classify unlabeled Fashion items in items file into the category that best described them.
Cost-effective to build and maintain.
As an emerging Fashion platform, it is important for X that the model can deal with novel items that it has not seen before (i.e., outside the items file).

3. Approach

3.1 Potential approaches

Labeling the Data:

Manual Labeling or Crowdsourcing: Engage human labelers directly or through platforms to categorize items, then apply supervised learning models.

Pros: Provides accurate and directly applicable training data.
Cons: High labor cost and time-consuming for thousands of items.

Label Only a Subset and Use One-shot Learning or Semi-supervised Techniques: Begin with a small, representative sample of labeled data and extrapolate to larger, unlabeled sets.

Pros: Reduces the need for extensive manual labeling while retaining model effectiveness.
Cons: May compromise the model’s accuracy if the subset is not sufficiently representative.

Utilizing Unlabeled Data:

Use ChatGPT API: Implement advanced NLP features without internal model development.

Pros: Access to state-of-the-art language processing capabilities.
Cons: Potentially high ongoing costs, especially when scaling.

Employ Embedding Models: Utilize pre-trained language models to convert text into meaningful, contextual embeddings. These embeddings can serve as direct inputs for predicting labels or can be used indirectly by feeding them into another machine-learning model.

Pros: Cost-effective; leverages the generalization power of large language models.
Cons: May require additional steps to tailor to specific classification needs.

3.2 Chosen approach: Embeddings with Cosine Similarity

Overview. The model should take the name of a fashion item as input and return the correct category as output using Embeddings generated from OpenAI's text-embedding-3-small model and Cosine Similarity.

The intuition. Think of embeddings as a way to turn the names of fashion items and their categories into points on a graph. The closer two points are on this graph, the more similar they are.

Men's Oxfords Items (Pink dots) are located around and nearer to the Men's Oxfords Category (Red dot) than they are to the Men's Jewelry Category (Blue dot)

Steps Explained:

S1. Make Points for Categories: Convert each category name into a point on our graph.
S2. Make Points for Items: Convert each item name into a point on our graph too.
S3. Find the Nearest Category for Each Item: By measuring the distance between an item's point and all category points (using cosine similarity), we find out which category is closest to the item. The closest category is considered the best match for that item.

Pros & Cons

Aspect	Pros	Cons
Data Requirements	Efficient with unlabeled data; leverages semantic meanings in item names.	Highly dependent on the quality and descriptiveness of item names.
Cost	Cost-effective to build and maintain due to use of pre-trained models and minimal computational needs.	Dependency on external models may limit control over operational costs and updates.
Scalability	Good at handling novel items due to generalization capabilities of pre-trained embeddings.	May struggle with items that have names not well represented in the model's training data.
Implementation	Simple and fast to deploy, suitable for startups and rapid development cycles.	Error diagnosis and correction can be complex due to the opaque nature of embedding models.
Model Performance	Quick and straightforward method for classifying items using a similarity-based approach.	Shallow contextual understanding may not capture deep nuances needed for accurate classification.(*)
Flexibility	Adaptable to various types of text data and robust against small changes in input style.	Risk of overfitting to specific linguistic patterns not universally applicable.
Maintenance	Low maintenance needs if embedding model remains effective for the application context.	Adjusting and updating the model relies on third-party developments (e.g., OpenAI updates).

4. Result

Using a similarity score threshold of 0.5, we achieved a preliminary accuracy of approximately 67% on a sample of 1,249 items, which represents 10% of the total data. This was accomplished by leveraging the Natural Language Understanding (NLU) capabilities of gpt-4 to verify the accuracy of the predicted labels.

Important observations:

Many items are outside of the predefined category tree, this impacts accuracy.
Since the differences between some categories are nuanced (e.g., "Women's Camisoles", "Women's Tank Tops"), the similarity scores are low for these items, which is an expected behavior (*).

Total costs: $1.62 (<$0.01 for Embedding model, and $1.59 for GPT-4).

Improvements. To fine-tune the performance of the classifier, you can consider:

Input: (1) make the input more meaningful and eliminate noise (if any), (2) change the category tree (the naming) so that the differences among the categories are more obvious.
Model: consider other more powerful models (e.g., OpenAI's text-embedding-3-large ), or models that better fit the language you are dealing with.
Approach: go with an entirely different approach.