malteos

Research engineer: Datasets, information retrieval, representation learning, LLMs, scientific & legal document processing

@commoncrawl

Berlin, Germany

Languages

Jupyter Notebook38%Python31%Java13%Dockerfile6%JavaScript6%TypeScript6%

Top Repositories

awesome-document-similarity

A curated list of resources on document similarity measures (papers, tutorials, code, ...)

256

pytorch-bert-document-classification

Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)

160Jupyter Notebook

scincl

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)

76Python

llm-datasets

A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.

64Python

aspect-document-similarity

Implementation, trained models and result data for the paper "Aspect-based Document Similarity for Research Papers" #COLING2020

63Jupyter Notebook

legal-document-similarity

Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal Literature Recommendations"

32Jupyter Notebook

Repositories

malteos/awesome-anonymization-for-llms

A collection of resources for PII detection, anonymization, privacy-preserving techniques, and GDPR compliance in Large Language Model (LLM) or AI applications.

90Updated 5 days ago

malteos/scincl

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)

Python761Updated 1 week ago

emnlp2022scidocsscincl

malteos/llm-datasets

A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.

Python646Updated 4 weeks ago

datasetslanguage-modelsllm

malteos/awesome-document-similarity

A curated list of resources on document similarity measures (papers, tutorials, code, ...)

25624Updated 4 weeks ago

malteos/cdx_toolkitFork

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

00Updated 2 months ago

malteos/awesome-prompt-optimization

A curated collection of resources for prompt engineering, optimization, and automatic prompt generation across text, image, video, and multimodal AI systems.

123Updated 3 months ago

malteos/crawler-commonsFork

A set of reusable Java components that implement functionality common to any web crawler

00Updated 4 months ago

malteos/colm-org.github.ioFork

COLM's official website repository

00Updated 4 months ago

malteos/cc-crawl-statisticsFork

Statistics of Common Crawl monthly archives mined from URL index files

00Updated 4 months ago

malteos/sentence-transformersFork

Multilingual Sentence & Image Embeddings with BERT

00Updated 4 months ago

malteos/german-language-models

A collection of German GPT language models

111Updated 5 months ago

malteos/NeMoFork

NeMo: a toolkit for conversational AI

10Updated 6 months ago

malteos/getting-started

No description provided.

Dockerfile42Updated 6 months ago

malteos/aspect-document-similarity

Implementation, trained models and result data for the paper "Aspect-based Document Similarity for Research Papers" #COLING2020

Jupyter Notebook638Updated 7 months ago

malteos/pytorch-bert-document-classification

Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)

Jupyter Notebook16023Updated 7 months ago

malteos/Leaflet.SimArchived

Leaflet.Sim is a framework for location-based simulations with Leaflet maps that can visualise moving markers, which can change their style, and events over time on a map.

JavaScript20Updated 9 months ago

malteos/semantic-document-relations

Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"

Python312Updated 10 months ago

bertdocumentdocument-classificationpytorchsimilaritytransformerwikipediaxlnet

malteos/wurzelFork

Wurzel is an open-source Python framework for advanced ETL pipelines in Retrieval-Augmented Generation (RAG) systems.

00Updated 11 months ago

malteos/clp-transfer

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Python302Updated 11 months ago

malteos/legal-document-similarity

Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal Literature Recommendations"

Jupyter Notebook3210Updated 12 months ago

malteos/news-visualizationArchived

News visualization with Elastic Search and Kibana including NER, Sentiment Analysis and Geo Locations.

Java10Updated 1 year ago

malteos/Wikipedia2LuceneArchived

Import a Wikipedia XML Dump from HDFS to Lucene index or Elasticsearch and retrieve similar Wikipedia articles based on Lucene's MoreLikeThis query.

Java10Updated 1 year ago

malteos/finetune-evaluation-harness

No description provided.

Python20Updated 1 year ago

malteos/turkish-lm-bias

Investigating Gender Bias in Turkish Language Models

Jupyter Notebook10Updated 1 year ago

malteos/arqmath

No description provided.

Jupyter Notebook00Updated 1 year ago

malteos/mtebFork

MTEB: Massive Text Embedding Benchmark

00Updated 1 year ago

malteos/chat-uiFork

Open source codebase powering the HuggingChat app

TypeScript10Updated 1 year ago

malteos/documentationFork

No description provided.

00Updated 1 year ago

malteos/wikipedia-article-recommendations

Survey data and Python code for the ICADL 2021 paper "A Qualitative Evaluation of User Preference for Link-based vs. Text-based Recommendations of Wikipedia Articles"

Jupyter Notebook50Updated 2 years ago

citolyticscpadocument-similaritymorelikethisqualitative-evaluationrecommender-systemswikipedia

malteos/Megatron-LLMFork

distributed trainer for LLMs

00Updated 2 years ago

malteos

Languages

Top Repositories

Repositories

Gists

Recent Activity