malteos
malteos
Research engineer: Datasets, information retrieval, representation learning, LLMs, scientific & legal document processing
Languages
Top Repositories
A curated list of resources on document similarity measures (papers, tutorials, code, ...)
Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
Implementation, trained models and result data for the paper "Aspect-based Document Similarity for Research Papers" #COLING2020
Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal Literature Recommendations"
Repositories
85A collection of resources for PII detection, anonymization, privacy-preserving techniques, and GDPR compliance in Large Language Model (LLM) or AI applications.
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
A curated list of resources on document similarity measures (papers, tutorials, code, ...)
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
A curated collection of resources for prompt engineering, optimization, and automatic prompt generation across text, image, video, and multimodal AI systems.
A set of reusable Java components that implement functionality common to any web crawler
COLM's official website repository
Statistics of Common Crawl monthly archives mined from URL index files
Multilingual Sentence & Image Embeddings with BERT
A collection of German GPT language models
NeMo: a toolkit for conversational AI
No description provided.
Implementation, trained models and result data for the paper "Aspect-based Document Similarity for Research Papers" #COLING2020
Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)
Leaflet.Sim is a framework for location-based simulations with Leaflet maps that can visualise moving markers, which can change their style, and events over time on a map.
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Wurzel is an open-source Python framework for advanced ETL pipelines in Retrieval-Augmented Generation (RAG) systems.
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal Literature Recommendations"
News visualization with Elastic Search and Kibana including NER, Sentiment Analysis and Geo Locations.
Import a Wikipedia XML Dump from HDFS to Lucene index or Elasticsearch and retrieve similar Wikipedia articles based on Lucene's MoreLikeThis query.
No description provided.
Investigating Gender Bias in Turkish Language Models
No description provided.
MTEB: Massive Text Embedding Benchmark
Open source codebase powering the HuggingChat app
No description provided.
Survey data and Python code for the ICADL 2021 paper "A Qualitative Evaluation of User Preference for Link-based vs. Text-based Recommendations of Wikipedia Articles"
distributed trainer for LLMs