GitHunt
MA

mayankmittal29/DupliFinder-Quora-Clone-Catcher

Detecting duplicate Quora question pairs using ML, deep learning (LSTM, BERT), advanced NLP, feature engineering, and ensemble methodsโ€”optimized for accuracy and scalability.

๐Ÿ” DupliFinder: Quora Question Pairs Challenge ๐Ÿ”

Python
Machine Learning
Deep Learning
Status

๐Ÿ“š Problem Statement

Quora is a platform where people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge on Quora is to quickly identify duplicate questions to provide better user experience and maintain high-quality content.

This project aims to tackle the Quora Question Pairs challenge from Kaggle, which requires building a machine learning model to identify whether a pair of questions are semantically identical (duplicates) or not.

๐ŸŽฏ Project Goals

  • Develop models to accurately classify question pairs as duplicates or non-duplicates
  • Experiment with various text preprocessing techniques
  • Compare performance of traditional ML algorithms and deep learning approaches
  • Extract and engineer useful features from text data
  • Optimize model performance through hyperparameter tuning and cross-validation

๐Ÿ“Š Dataset Description

The dataset consists of over 400,000 question pairs from Quora, each with the following fields:

  • id: The unique identifier for a question pair
  • qid1, qid2: Unique identifiers for each question (only in train.csv)
  • question1, question2: The full text of each question
  • is_duplicate: The target variable (1 if questions are duplicates, 0 otherwise)

โš ๏ธ Note: The ground truth labels are subjective and were provided by human experts. While they represent a reasonable consensus, they may not be 100% accurate on a case-by-case basis.

๐Ÿ”ง Methodology

1. Data Exploration and Preprocessing

  • Exploratory Data Analysis (EDA) ๐Ÿ“ˆ

    • Distribution of duplicate/non-duplicate questions
    • Question length analysis
    • Word frequency analysis
    • Visualization of key features
  • Text Preprocessing ๐Ÿงน

    • Removal of HTML tags and special characters
    • Expanding contractions
    • Tokenization
    • Stopword removal
    • Stemming/Lemmatization
    • Advanced cleaning techniques

2. Feature Engineering

  • Basic Features ๐Ÿงฎ

    • Question length
    • Word count
    • Common words between questions
    • Word share ratio
  • Advanced Features ๐Ÿ”ฌ

    • Token features (common words, stopwords, etc.)
    • Length-based features
    • Fuzzy matching features (Levenshtein distance, etc.)
    • TF-IDF features
    • Word embedding features

3. Text Representation Methods

  • Bag of Words (BoW) ๐Ÿ“
  • TF-IDF Vectorization ๐Ÿ“Š
  • Word Embeddings ๐Ÿ”ค
    • Word2Vec
    • GloVe
    • FastText
  • Contextual Embeddings ๐Ÿง 
    • BERT
    • RoBERTa
    • DistilBERT

4. Machine Learning Models

  • Traditional ML Algorithms ๐Ÿค–

    • Random Forest
    • XGBoost
    • Support Vector Machines (SVM)
    • Logistic Regression
    • Naive Bayes
  • Deep Learning Models ๐Ÿง 

    • LSTM/BiLSTM
    • Siamese Networks
    • Transformer-based models
    • Fine-tuned BERT/RoBERTa

5. Model Optimization

  • Hyperparameter Tuning ๐ŸŽ›๏ธ

    • Grid Search
    • Random Search
    • Bayesian Optimization
  • Cross-Validation โœ…

    • K-Fold Cross-Validation
    • Stratified K-Fold Cross-Validation
  • Ensemble Methods ๐Ÿค

    • Voting
    • Stacking
    • Bagging

๐Ÿ“ˆ Performance Metrics

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • ROC-AUC
  • Log Loss

๐Ÿš€ Results

Model Embedding Accuracy F1 Score ROC-AUC
Random Forest BoW 80.2% 0.79 0.86
XGBoost BoW 81.3% 0.80 0.87
SVM TF-IDF 82.5% 0.81 0.88
LSTM Word2Vec 83.7% 0.82 0.89
BERT Contextual 87.2% 0.86 0.92

Note: This table will be updated as more models are implemented and tested.

๐Ÿ”ฎ Future Work

  • Implement more advanced deep learning architectures
  • Experiment with different embedding techniques
  • Explore transfer learning approaches
  • Investigate attention mechanisms
  • Develop an ensemble of best-performing models
  • Build a simple web app for question duplicate detection

๐Ÿ› ๏ธ Tools and Technologies

  • Programming Language: Python
  • ML Libraries: Scikit-learn, XGBoost, LightGBM
  • DL Libraries: TensorFlow, Keras, PyTorch
  • NLP Libraries: NLTK, SpaCy, Transformers
  • Data Manipulation: NumPy, Pandas
  • Visualization: Matplotlib, Seaborn, Plotly
  • Text Processing: Regex, BeautifulSoup, FuzzyWuzzy

๐Ÿ“‚ Repository Structure

DupliFinder/
โ”‚
โ”œโ”€โ”€ data/                      # Dataset files
โ”‚   โ”œโ”€โ”€ train.csv              # Training set
โ”‚   โ””โ”€โ”€ test.csv               # Test set
โ”‚
โ”œโ”€โ”€ notebooks/                 # Jupyter notebooks
โ”‚   โ”œโ”€โ”€ 1_EDA.ipynb            # Exploratory Data Analysis
โ”‚   โ”œโ”€โ”€ 2_Preprocessing.ipynb  # Text preprocessing
โ”‚   โ”œโ”€โ”€ 3_FeatureEngineering.ipynb # Feature engineering
โ”‚   โ”œโ”€โ”€ 4_Traditional_ML.ipynb # Traditional ML models
โ”‚   โ””โ”€โ”€ 5_Deep_Learning.ipynb  # Deep learning models
โ”‚
โ”œโ”€โ”€ src/                       # Source code
โ”‚   โ”œโ”€โ”€ preprocessing/         # Text preprocessing modules
โ”‚   โ”œโ”€โ”€ features/              # Feature engineering modules
โ”‚   โ”œโ”€โ”€ models/                # Model implementations
โ”‚   โ”œโ”€โ”€ utils/                 # Utility functions
โ”‚   โ””โ”€โ”€ visualization/         # Visualization functions
โ”‚
โ”œโ”€โ”€ models/                    # Saved model files
โ”‚
โ”œโ”€โ”€ app/                       # Web application files
โ”‚
โ”œโ”€โ”€ requirements.txt           # Project dependencies
โ”‚
โ””โ”€โ”€ README.md                  # Project documentation

๐Ÿš€ Getting Started

Prerequisites

  • Python 3.7+
  • pip

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/DupliFinder.git
cd DupliFinder
  1. Install dependencies:
pip install -r requirements.txt
  1. Download the dataset:
mkdir -p data
# Download from Kaggle and place in data/ directory
  1. Run the notebooks or scripts:
jupyter notebook notebooks/1_EDA.ipynb

๐Ÿ“Š Demo

Demo GIF

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Kaggle for hosting the original competition
  • Quora for providing the dataset
  • The open-source community for their invaluable tools and libraries

๐Ÿ“ฌ Contact

If you have any questions or suggestions, feel free to reach out:


โญ Star this repository if you find it useful! โญ

mayankmittal29/DupliFinder-Quora-Clone-Catcher | GitHunt