Detecting Question Similarity Using BERT

Inspiration

I became aware of the possibilities of applying AI to help administrative personnel manage and reduce their workload by finding and highlighting extremely similar questions when I came across the Quora Question Pairs dataset during a Natural Language Processing course.

What it accomplishes

  1. Preprocessing: Cleaning, tokenizing, and formatting text data for NLP tasks.
  2. Data augmentation: Utilizing the transitive property to infer similarity between questions.
  3. Training: Utilizing a BERT-based model to learn representations of the data and classify question pairs as similar or not.
  4. Grouping: Grouping questions into clusters for effective inference.
  5. Classification: Classification was done using a supervised technique.

How it was constructed

  1. Preprocessing: Cleaned, tokenized, and formatted text data using lower case, tokenization, stop word removal, POS tagging, punctuation removal, and lemmatization.
  2. Data augmentation: Utilized the transitive property to infer similarity between questions.
  3. Training: Trained a BERT-based model to learn representations of the data and classify question pairs as similar or not.
  4. Grouping: Grouped questions into clusters for effective inference.
  5. Classification: Classified question pairs as similar or not using a supervised technique.

Difficulties we encountered

  • Data Preprocessing: Initially, I had trouble tokenizing and preparing the data correctly. I overcame this by looking through the material, asking professionals for assistance, and experimenting with different preparation methods.
  • Data Augmentation: It was difficult to put the transitive property to use in the data augmentation method. Upon conducting study and testing several strategies, I was able to effectively execute the procedure and produce 25% more training datapoints.
  • Model Training: I had trouble tweaking and training the model. I enhanced the model's performance and got decent results by modifying the batch size, learning rate, and other hyperparameters.

Achievements of which we are pleased

  • Successfully added data augmentation, resulting in a 25% larger dataset.
  • Created a model using BERT that effectively determines whether or not question pairs are similar.
  • Overcame difficulties with model training, data augmentation, and data preprocessing.

What we discovered

  • Preparing data: Realized how crucial data preparation methods are to getting text data ready for NLP jobs.
  • Data Enrichment: Learned how to use the transitive property of similarity to apply data augmentation to NLP jobs.
  • Models based on BERT: Acquired practical expertise adjusting pre-trained BERT models to a particular purpose.
  • Effective Inference: Acquired knowledge about how crucial it is to organize questions into clusters for effective inference.

What's coming up with BERT-based question similarity detection?

  • Investigating several methods of data augmentation to enhance the functionality of the model even more.
  • Looking into the application of different transformer-based models for the detection of question similarity.
  • Using the same technique on other datasets and scenarios.
  • Integrating real-time chatbots or web applications with question similarity recognition.

Built With

Share this project:

Updates