mayankmittal29/DupliFinder-Quora-Clone-Catcher
Detecting duplicate Quora question pairs using ML, deep learning (LSTM, BERT), advanced NLP, feature engineering, and ensemble methodsโoptimized for accuracy and scalability.
๐ DupliFinder: Quora Question Pairs Challenge ๐
๐ Problem Statement
Quora is a platform where people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge on Quora is to quickly identify duplicate questions to provide better user experience and maintain high-quality content.
This project aims to tackle the Quora Question Pairs challenge from Kaggle, which requires building a machine learning model to identify whether a pair of questions are semantically identical (duplicates) or not.
๐ฏ Project Goals
- Develop models to accurately classify question pairs as duplicates or non-duplicates
- Experiment with various text preprocessing techniques
- Compare performance of traditional ML algorithms and deep learning approaches
- Extract and engineer useful features from text data
- Optimize model performance through hyperparameter tuning and cross-validation
๐ Dataset Description
The dataset consists of over 400,000 question pairs from Quora, each with the following fields:
- id: The unique identifier for a question pair
- qid1, qid2: Unique identifiers for each question (only in train.csv)
- question1, question2: The full text of each question
- is_duplicate: The target variable (1 if questions are duplicates, 0 otherwise)
The dataset having train.csv and test.csv can in found in this one drive link :- https://iiithydstudents-my.sharepoint.com/:u:/g/personal/mayank_mittal_students_iiit_ac_in/Ef2igGfs64VDqRpSfgYc7-8Biad7vuYDD7qrnD2NDngVmQ?e=SUeJaH
๐ง Methodology
1. Data Exploration and Preprocessing
-
Exploratory Data Analysis (EDA) ๐
- Distribution of duplicate/non-duplicate questions
- Question length analysis
- Word frequency analysis
- Visualization of key features
-
Text Preprocessing ๐งน
- Removal of HTML tags and special characters
- Expanding contractions
- Tokenization
- Stopword removal
- Stemming/Lemmatization
- Advanced cleaning techniques
2. Feature Engineering
-
Basic Features ๐งฎ
- Question length
- Word count
- Common words between questions
- Word share ratio
-
Advanced Features ๐ฌ
- Token features (common words, stopwords, etc.)
- Length-based features
- Fuzzy matching features (Levenshtein distance, etc.)
- TF-IDF features
- Word embedding features
3. Text Representation Methods
- Bag of Words (BoW) ๐
- TF-IDF Vectorization ๐
- Word Embeddings ๐ค
- Word2Vec
- GloVe
- FastText
- Contextual Embeddings ๐ง
- BERT
- RoBERTa
- DistilBERT
4. Machine Learning Models
-
Traditional ML Algorithms ๐ค
- Random Forest
- XGBoost
- Support Vector Machines (SVM)
- Logistic Regression
- Naive Bayes
-
Deep Learning Models ๐ง
- LSTM/BiLSTM
- Siamese Networks
- Transformer-based models
- Fine-tuned BERT/RoBERTa
5. Model Optimization
-
Hyperparameter Tuning ๐๏ธ
- Grid Search
- Random Search
- Bayesian Optimization
-
Cross-Validation โ
- K-Fold Cross-Validation
- Stratified K-Fold Cross-Validation
-
Ensemble Methods ๐ค
- Voting
- Stacking
- Bagging
๐ Performance Metrics
- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC
- Log Loss
๐ Results
| Model | Embedding | Accuracy | F1 Score | ROC-AUC |
|---|---|---|---|---|
| Random Forest | BoW | 80.2% | 0.79 | 0.86 |
| XGBoost | BoW | 81.3% | 0.80 | 0.87 |
| SVM | TF-IDF | 82.5% | 0.81 | 0.88 |
| LSTM | Word2Vec | 83.7% | 0.82 | 0.89 |
| BERT | Contextual | 87.2% | 0.86 | 0.92 |
Note: This table will be updated as more models are implemented and tested.
๐ฎ Future Work
- Implement more advanced deep learning architectures
- Experiment with different embedding techniques
- Explore transfer learning approaches
- Investigate attention mechanisms
- Develop an ensemble of best-performing models
- Build a simple web app for question duplicate detection
๐ ๏ธ Tools and Technologies
- Programming Language: Python
- ML Libraries: Scikit-learn, XGBoost, LightGBM
- DL Libraries: TensorFlow, Keras, PyTorch
- NLP Libraries: NLTK, SpaCy, Transformers
- Data Manipulation: NumPy, Pandas
- Visualization: Matplotlib, Seaborn, Plotly
- Text Processing: Regex, BeautifulSoup, FuzzyWuzzy
๐ Repository Structure
DupliFinder/
โ
โโโ data/ # Dataset files
โ โโโ train.csv # Training set
โ โโโ test.csv # Test set
โ
โโโ notebooks/ # Jupyter notebooks
โ โโโ 1_EDA.ipynb # Exploratory Data Analysis
โ โโโ 2_Preprocessing.ipynb # Text preprocessing
โ โโโ 3_FeatureEngineering.ipynb # Feature engineering
โ โโโ 4_Traditional_ML.ipynb # Traditional ML models
โ โโโ 5_Deep_Learning.ipynb # Deep learning models
โ
โโโ src/ # Source code
โ โโโ preprocessing/ # Text preprocessing modules
โ โโโ features/ # Feature engineering modules
โ โโโ models/ # Model implementations
โ โโโ utils/ # Utility functions
โ โโโ visualization/ # Visualization functions
โ
โโโ models/ # Saved model files
โ
โโโ app/ # Web application files
โ
โโโ requirements.txt # Project dependencies
โ
โโโ README.md # Project documentation
๐ Getting Started
Prerequisites
- Python 3.7+
- pip
Installation
- Clone the repository:
git clone https://github.com/yourusername/DupliFinder.git
cd DupliFinder- Install dependencies:
pip install -r requirements.txt- Download the dataset:
mkdir -p data
# Download from Kaggle and place in data/ directory- Run the notebooks or scripts:
jupyter notebook notebooks/1_EDA.ipynb๐ Demo
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Kaggle for hosting the original competition
- Quora for providing the dataset
- The open-source community for their invaluable tools and libraries
๐ฌ Contact
If you have any questions or suggestions, feel free to reach out:
- GitHub: your-username
- LinkedIn: your-linkedin
โญ Star this repository if you find it useful! โญ
