Detecting Question Similarity Using BERT
Inspiration
I became aware of the possibilities of applying AI to help administrative personnel manage and reduce their workload by finding and highlighting extremely similar questions when I came across the Quora Question Pairs dataset during a Natural Language Processing course.
What it accomplishes
- Preprocessing: Cleaning, tokenizing, and formatting text data for NLP tasks.
- Data augmentation: Utilizing the transitive property to infer similarity between questions.
- Training: Utilizing a BERT-based model to learn representations of the data and classify question pairs as similar or not.
- Grouping: Grouping questions into clusters for effective inference.
- Classification: Classification was done using a supervised technique.
How it was constructed
- Preprocessing: Cleaned, tokenized, and formatted text data using lower case, tokenization, stop word removal, POS tagging, punctuation removal, and lemmatization.
- Data augmentation: Utilized the transitive property to infer similarity between questions.
- Training: Trained a BERT-based model to learn representations of the data and classify question pairs as similar or not.
- Grouping: Grouped questions into clusters for effective inference.
- Classification: Classified question pairs as similar or not using a supervised technique.
Difficulties we encountered
- Data Preprocessing: Initially, I had trouble tokenizing and preparing the data correctly. I overcame this by looking through the material, asking professionals for assistance, and experimenting with different preparation methods.
- Data Augmentation: It was difficult to put the transitive property to use in the data augmentation method. Upon conducting study and testing several strategies, I was able to effectively execute the procedure and produce 25% more training datapoints.
- Model Training: I had trouble tweaking and training the model. I enhanced the model's performance and got decent results by modifying the batch size, learning rate, and other hyperparameters.
Achievements of which we are pleased
- Successfully added data augmentation, resulting in a 25% larger dataset.
- Created a model using BERT that effectively determines whether or not question pairs are similar.
- Overcame difficulties with model training, data augmentation, and data preprocessing.
What we discovered
- Preparing data: Realized how crucial data preparation methods are to getting text data ready for NLP jobs.
- Data Enrichment: Learned how to use the transitive property of similarity to apply data augmentation to NLP jobs.
- Models based on BERT: Acquired practical expertise adjusting pre-trained BERT models to a particular purpose.
- Effective Inference: Acquired knowledge about how crucial it is to organize questions into clusters for effective inference.
What's coming up with BERT-based question similarity detection?
- Investigating several methods of data augmentation to enhance the functionality of the model even more.
- Looking into the application of different transformer-based models for the detection of question similarity.
- Using the same technique on other datasets and scenarios.
- Integrating real-time chatbots or web applications with question similarity recognition.
Built With
- bert
- nltk
- numpy
- pandas
- python
- pytoch
- scikit-learn
- transformers
Log in or sign up for Devpost to join the conversation.