GitHunt
AP

app1606/fakenews_bot

Fake news bot

Install

Dependencies:

pip install -r requirements.txt

  • pytorch==2.1.1
  • numpy==1.23.5
  • transformers==4.35.2 for huggingface transformers
  • datasets==2.15.0
  • wandb==0.16.0 for optional logging
  • peft==0.6.2 for model training optimization
  • tqdm==4.66.1 for progress bars
  • python-telegram-bot==20.7 for telegram bot implementation
  • pandas==1.5.3, seaborn==0.12.2, matplotlib==3.7.1, wordcloud==1.9.2, regex==2023.6.3, mdutils==1.6.0 for the whole working with data process

Quick start

If you just want to try our project in action, then click on the link. You will be greeted by a bot with the following commands:

  • start -- for getting the bot description
  • help -- for getting commands descriprions
  • generate -- for headline generation. You will be provided with a list of topics, you just need to select one of them and you will receive the latest non-existent news. Be careful: sometimes they scare you with their plausibility!

Baselines

We use GPT2LMHeadModel from transformers as a baseline model both for English and Russian language modeling, but with different initial weights. We use russian language weights from Sber-AI and model pre-trained GPT-2 from HuggingFace for the english language.

Efficiency notes

The peft library was used to speed up the training process. The parameters were taken from the paper LoRA: Low-Rank Adaptation of Large Language Models and are shown here:

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=4,
    lora_alpha=32, 
    target_modules=["c_attn",  "c_proj"],
    lora_dropout=0.1,
)

Using this library allowed us to reduce the training time from 2 hours for 1 epoch on News Category Dataset to just 45 minutes.

Experiments

We compared the run statistics over different PEFT Model trinings with small rank parameter of 4. The runtime was the same as the PEFT Model was the same, but it was a significant improvement after the full finetuning. We see that the loss for the Russian language model is less.

Experiment 4 was conducted on BBC dataset, a smaller one, News Category dataset was used for the third experiment. There are no significant difference in terms of loss for these two graphs, the one is just shorter because of the size of the dataset.

photo_2023-12-05 19 21 41

Sampling / Inference

Inference example can be found in the Inference section of notebooks/Fake_News_Generator.ipynb.

First you have to load the weights:

model_to_merge = PeftModel.from_pretrained(model, path_to_peft_weights)
merged_model = model_to_merge.merge_and_unload()

Then one can use generate_headline function (defined in the notebook) based on model.generate() or write own version.

Telegram bot

To launch your own telegram bot, you will need to get a special token. The process is described in detail in this tutorial.

Deployment

We use Kamatera portal for our server. Our bot works asynchronous and the code can be found in bot.py.

Todos

  • Explore other embeddings
  • Experiment with optimizations and hyperparameters
  • Additional logging around network
  • Organize code into python scripts

Acknowledgements

Many thanks to Panorama News Agency & Publishing House for providing the dataset and Serge Kim for the guidance throughout the project.

Languages

Jupyter Notebook99.9%Python0.1%

Contributors

MIT License
Created November 8, 2023
Updated December 5, 2023
app1606/fakenews_bot | GitHunt