Detecting Question Similarity Using BERT

Inspiration

I became aware of the possibilities of applying AI to help administrative personnel manage and reduce their workload by finding and highlighting extremely similar questions when I came across the Quora Question Pairs dataset during a Natural Language Processing course.

What it accomplishes

Preprocessing: Cleaning, tokenizing, and formatting text data for NLP tasks.
Data augmentation: Utilizing the transitive property to infer similarity between questions.
Training: Utilizing a BERT-based model to learn representations of the data and classify question pairs as similar or not.
Grouping: Grouping questions into clusters for effective inference.
Classification: Classification was done using a supervised technique.

How it was constructed

Preprocessing: Cleaned, tokenized, and formatted text data using lower case, tokenization, stop word removal, POS tagging, punctuation removal, and lemmatization.
Data augmentation: Utilized the transitive property to infer similarity between questions.
Training: Trained a BERT-based model to learn representations of the data and classify question pairs as similar or not.
Grouping: Grouped questions into clusters for effective inference.
Classification: Classified question pairs as similar or not using a supervised technique.

Difficulties we encountered

Data Preprocessing: Initially, I had trouble tokenizing and preparing the data correctly. I overcame this by looking through the material, asking professionals for assistance, and experimenting with different preparation methods.
Data Augmentation: It was difficult to put the transitive property to use in the data augmentation method. Upon conducting study and testing several strategies, I was able to effectively execute the procedure and produce 25% more training datapoints.
Model Training: I had trouble tweaking and training the model. I enhanced the model's performance and got decent results by modifying the batch size, learning rate, and other hyperparameters.

Achievements of which we are pleased

Successfully added data augmentation, resulting in a 25% larger dataset.
Created a model using BERT that effectively determines whether or not question pairs are similar.
Overcame difficulties with model training, data augmentation, and data preprocessing.

What we discovered

Preparing data: Realized how crucial data preparation methods are to getting text data ready for NLP jobs.
Data Enrichment: Learned how to use the transitive property of similarity to apply data augmentation to NLP jobs.
Models based on BERT: Acquired practical expertise adjusting pre-trained BERT models to a particular purpose.
Effective Inference: Acquired knowledge about how crucial it is to organize questions into clusters for effective inference.

What's coming up with BERT-based question similarity detection?

Investigating several methods of data augmentation to enhance the functionality of the model even more.
Looking into the application of different transformer-based models for the detection of question similarity.
Using the same technique on other datasets and scenarios.
Integrating real-time chatbots or web applications with question similarity recognition.

Built With

bert
nltk
numpy
pandas
python
pytoch
scikit-learn
transformers

Updates

Jagaadhep U K started this project — Mar 24, 2024 11:24 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.