AN
angelosalatino/awesome-scholarly-data-analysis
A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.
Awesome Scholarly Data Analysis
List of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.
Table of Contents
- Awesome Scholarly Data Analysis
- Table of Contents
- Datasets
- Tools
- Publication Venues
- Summer Schools
- Associations & Community
- Contributions
Table of contents generated with markdown-toc
Datasets
Publication and Citation
- Arnet Miner
- Microsoft Academic Graph
- Open Academic Graph - MAG + AMiner
- Semantic Scholar Corpus
- CiteSeer
- PubMed
- CORA datasets
- CrossRef DOI URLs
- DBLP Citation dataset
- NBER Patent Citations
- Scopus Citation Database
- Papers, patents, and grants from Indiana University
- Small Network Data - Mark Newman's Lab
- The Koblenz Network Collection
- Google Scholar citation relations
- Open citations project
- Wikicite Project
- Ecnonomic Papers
- ArXiv data dump
- Complete ACL anthology as bibtex file
- ACL Anthology Reference Corpus
- Astrophysics data system (ADS) - All physics papers
- CORE 37M full text open access papers
- Inspire database for high energy physics articles
- Scholarly Data of workshops and conferences in RDF triplets
- The Collection of Computer Science Bibliographies
- OpenCitations corpus
- COCI Doi-Doi citation data
- DOAJ API (Directory of Open Access Journals)
- ROAD (Directory of Open Access Scholarly Resources)
- Sherpa/Romeo (Publisher copyright policies & self-archiving)
- OpenAPC (fees paid for open access journal articles)
- OSF API (Open Science Framework)
- Digital tools for researchers
Academic Genealogy
- Mathematics Genealogy Project
- Academic Tree - Cross discipline academic genealogies
- MPACT project - Library Sciences
- PhDTree
- Chemistry Genealogy - curated at UIUC
- Notre Dame Genealogy Project
- UIUC Chemistry, Chemical Engineering, and Biochemistry
- Software Engineering Academic Genealogy
- Other lists of genealogy projects
- Wikipedia - Computer Science Genealogy
- Wikipedia - Theorecical Physicits Genealogy
- Wikipedia - Chemists Genealogy
- SCIENTIFIC GENEALOGY MASTER LIST - Scientists Associated with Concepts in Chemistry & Physics
- Economic Geneology Text Format
Author Profiles
- Temporal profiles of PubMed authors
- ORCID data dump
- National Library of Medicine Profiles
- UIUC Professors database - Publications, Affiliations
- Author Profiles of scholarly authors in Wikipedia
Author name disambiguation
- INSPIRE dataset
- Lee Giles dataset
- Cleaner version of Lee Giles dataset
- DBLP Korean Authors
- Arnet Miner
- DBLP Name disambiguation dataset
- rexa-coref-data
Thesis datasets
- Open Access Theses and Dissertations
- The Networked Digital Library of Theses and Dissertations (NDLTD)
- PhD Dissertations in the Area of Software Engineering
- ProQuest Dissertations & Theses Global
Information Extraction and NLP
- Citation Parsing
- Document Summarization
- Keyphrase Extraction
- Related Work Summarization
- Biomedical NLP annotated datasets
- Chemical compound and drug name recognition task
- Semantic Scholar Dataset
- ScienceIE
- ACL RD TEC 2.0 also at @CLARIN
- SEPID Corpus - Segmended ACL ARC 1.0
- PubMed Central Open Access - BioC
- PubMed Fulltext - protein-protein and genetic interactions
- BioNLP - Argo
- Biomedical NLP - Stav
- GENIA - BioNLP 2011
- Genia Treebank used for SciSpacy training - SciSpacy link
- Full GENIA corpus
- Anatomical Entity Mention (AnEM) corpus
- CellFinder - Entity detection
- Multi-Level Event Extraction (MLEE)
- Biomedical sentence simplification
- PubMed - Colorado Richly Annotated Full-Text
- Biomedical NER datasets related publication
- BioVerbNet
- Lunar and Planetary Science abstracts for NER and Relations
- ACM data affiliations
- ACM - DBLP database entry matching
- Colorado Richly Annotated Full-Text - PubMed abstract annotated with entities mapped to 10 biomedical ontology terms.
- CLEF datasets for multilingual Biomedical NLP+IE
- MedMentions - UMLS entities in PubMed
Networks
Taxonomies and Ontologies of Research Concepts
- SciGraph Springer Nature
- Medical Subject Headings maintained by the National Library of Medicine of the United States
- Computer Science Ontology maintained by Scholarly Knowledge: Modeling, Mining and Sense Making
- Physics Subject Headings maintained by American Physical Society (APS)
- Open Biological and Biomedical Ontology (OBO) maintained by the OBO Foundry
- ACM Computing Classification System maintained by the Association for Computing Machinery
- Physics and Astronomy Classification Scheme (PACS) maintained by American Institute of Physics (AIP) discontinued in 2010 and replaced by Physics Subject Headings
- Mathematics Subject Classification (MSC) mantained by Mathematical Reviews and zbMATH
- Journal of Economic Literature (JEL) maintained by the American Economic Association
- STW Thesaurus for Economics maintained by ZBW - Leibniz Information Centre for Economics
- Australian and New Zealand Standard Research Classification (ANZSRC) maintained by Australian Bureau of Statistics, it consists of 3 sub-classification schemes:
- Fields of Research (FoR) classification
- Research Fields, Courses and Disciplines (RFCD) classification
- Socio-Economic Objective (SEO) classification
- Library of Congress Classification (LCC) maintained by Library of Congress
- Fields of Study (FoS) maintained by Microsoft Academic
Affiliations
Altmetrics and Dimensions
Tools
User interface to publication datasets & analysis
- Google Scholar
- Semantic Scholar
- Microsoft Academic Graph
- AceMap
- GitXiv
- ACL Anthology
- NIPS papers
- Abel tools for PubMed data
- infolis: linking research data and publications
- Metrics toolkit
- Rcrossref (R library)
- Rscopus (R library)
- Scholar (R library)
- Bibliometrix (R library)
- CITAN (R library)
- BibeR (BibeR: A Web-based tool for bibliometric analysis in scientific literature)
- scihub.py (Python library)
- SoPaper (Python library)
- CiteSeer tools
- Novelty quantification in PubMed articles
Tools for collecting open access papers
- ContentMine - getpapers
- rcoreoa - CORE API R client
Tools for classifying research papers
Visualizations
Language Processing and Information Extraction
- Biomedical - BioSentVec Embeddings
- Biomedical embeddings - CambridgeLTL
- NIH scientific paper pre-processing
- SciSpacy - Spacy models for Biomedical NLP from AllenAI
Citation and metadata extraction
Publication Venues
Journals
- Frontiers in Research Metrics and Analytics
- Scientometrics
- Journal of Informetrics
- Quantitative Science Studies (Open Access)
- Science, technology and human values
- Social Studies of Science
- Science and Public Policy
Conferences
- Joint Conference on Digital Libraries (JCDL)
- International Conference on Theory and Practice of Digital Libraries (TPDL)
- European Semantic Web Conference (ESWC), Research of Research Track
- STI Conference series (Science and Technology indicators, e.g., 2018)
- ISSI Conference series (INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS, e.g., 2019)
Workshops
- SIGMET - Metrics workshop
- International Workshop on Mining Scientific Publications
- Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination (SAVE-SD)
- Workshop on Reframing Research (RefResh)
- Enabling Open Semantic Science (SemSci)
Summer Schools
Associations & Community
- International Society for Informetrics and Scientometrics (ISSI)
- European Network of Indicator Designers (ENID)
- 4S (Society for Social Studies of Science)
Contributions
The following people have contributed to the items on this list.
- Shubhanshu Mishra - Maintainer of the list.
- Angelo Antonio Salatino
- Philipp Zumstein
- Ali (Aliakbar Akbaritabar)
