GitHunt
RA

ram02z/grobid

Python library for serializing GROBID TEI XML to dataclass

grobid

Python library for serializing GROBID TEI XML to dataclasses

Build Status
Coverage Status
Latest Version
Python Version
License

Installation

Use pip to install:

$ pip install grobid
$ pip install grobid[json] # for JSON serializable dataclass objects

You can also download the .whl file from the release section:

$ pip install *.whl

Usage

Client

In order to convert an academic PDF to TEI XML file, we use GROBID's REST
services. Specifically the processFulltextDocument endpoint.

from pathlib import Path
from grobid.models.form import Form, File
from grobid.models.response import Response

pdf_file = Path("<your-academic-article>.pdf")
with open(pdf_file, "rb") as file:
    form = Form(
        file=File(
            payload=file.read(),
            file_name=pdf_file.name,
            mime_type="application/pdf",
        )
    )
    c = Client(base_url="<base-url>", form=form)
    try:
        xml_content = c.sync_request().content  # TEI XML file in bytes
    except GrobidClientError as e:
        print(e)

where base-url is the URL of the GROBID REST service

You can use https://cloud.science-miner.com/grobid/ to test

Form

The Form class supports most of the optional parameters of the processFulltextDocument
endpoint.

Parser

If you want to serialize the XML content, we can use the Parser class to
create dataclasses
objects.

Not all of the GROBID annoation guidelines are met, but compliance is a goal.
See #1.

from grobid.tei import Parser

xml_content: bytes
parser = Parser(xml_content)
article = parser.parse()
article.to_json()  # raises RuntimeError if extra require 'json' not installed

where xml_content is the same as in Client section

Alternately, you can load the XML from a file:

from grobid.tei import Parser

with open("<your-academic-article>.xml", "rb") as xml_file:
  xml_content = xml_file.read()
  parser = Parser(xml_content)
  article = parser.parse()
  article.to_json()  # throws RuntimeError if extra require 'json' not installed

We use orjson to provide a method to_json to
serialize the dataclasses into JSON. By default, orjson isn't installed, use
pip install grobid[json].

License

MIT

Contributing

You are welcome to add missing features by submitting a PR, however, I won't be
accepting any requests other than GROBID annotation compliance.

Disclaimer

This module was originally part of a group university project, however, all the
code and tests was also authored by me.

Languages

Python100.0%

Contributors

MIT License
Created July 23, 2022
Updated August 1, 2025