RPMrepo Snapshots
RPM Repository Snapshot Management
The RPMrepo project creates persistent and immutable snapshots of RPM
repositories. It provides tools and infrastructure to create such snapshots, as
well as host and serve them.
Project
- Website: https://www.osbuild.org
- Bug Tracker: https://github.com/osbuild/rpmrepo/issues
Requirements
The requirements for this project are:
python >= 3.8
About
RPMrepo is comprised of a set of different utilities and infrastructure. The
main goal is to regularly create immutable snapshots of a set of public and
RedHat-private RPM repositories and provide them for a fixed amount of time
on our own infrastructure.
Our infrastructure is maintained via the OSBuild Terraform configuration. See
the
image-builder-terraform
repository, in particular the
rpmrepo.tf
configuration.
For user documentation on RPMrepo, see:
https://osbuild.org/docs/developer-guide/projects/rpmrepo/
The backend implementation of RPMrepo involves the following steps:
-
Target Repository Configuration
When running snapshot operations, we need to know the list of target RPM
repositories to snapshot, what to call the snapshots, and where to store
the data. This information is currently stored as JSON files in this
repository (see the./repo/subdirectory).For each target repository, we store a JSON dictionary with the following
information:-
"base-url": The RPM repository base-url to create the snapshot of. An
RPM repository base-url requires the root-level metadata
file to be accessible asrepodata/repomd.xml. See the
DNF / RPM documentation for more information, if desired. -
"platform-id": The DNF Platform ID to use. This allows to group
multiple snapshots together and share the backend
storage. We use this to deduplicate RPMs in our backend.
This ID can be freely chosen, but all snapshots that
share an ID can only be deleted together. We usually
pick the actual DNF Platform ID (see the DNF
module_platform_idfor details) here, but this is
not required. -
"singleton": We usually create snapshots regularly. In case a target
repository is already immutable by design, this key can
be set to make sure only a single snapshot of this
repository is ever taken. Simply set this key to the
snapshot suffix to use for the singleton snapshot, and
all snapshot operations will use this suffix (and thus
skipping the operation if it already exists). -
"snapshot-id": The name of the snapshot to store it as. Usually this
is the same as the name of this file without extension.
This name can be freely chosen. We usually name
snapshots as:``` <platform-id>-<arch>-<repo>[-<repo-version>]. ``` Note that the actual snapshots will get a suffix like `-<date>` appended automatically. This field must not include this suffix in the snapshot ID. -
"storage": The ID of the storage location to use. We have different
storage locations for different access rights. For now, this
is just a string that specifies the directory in our backend
storage. See the backend information for possible values.
-
-
Snapshot Creation
To create snapshots, we use the
reposyncdnf module. Seednf reposync
for more information. This tool just downloads an entire RPM repository
to local storage. We then index this data for our backend storage and
upload it.The
./src/ctl/directory implements the command-line control client
that we use for this. It is a python module that just wrapsdnf reposync
to download a repository, provides indexing helpers, and then wraps the
AWSboto3API to upload everything to our storage.Note that a single snapshot might store up to 100GiB of data intermittently
and can take up to 8h. Therefore, none of the default script execution
engines can be used, since they either have limited disk-space or limited
execution time.We provide a container (see the
osbuild/containersrepository) called
rpmrepo-snapshotwhich reads the configuration in./repo/and uses the
python module in./src/ctl/to create a snapshot. The container supports
batched execution, thus can be used to create many snapshots in parallel. -
Storage
We currently store all snapshots in a dedicated AWS S3 bucket called
rpmrepo-storage. Since we store a lot of data, we employ a data
deduplication strategy. All actual data files are stored with their sha256
checksum as name indata/<storage>/<platform-id>/sha256-<checksum>. This
means matching files will be deduplicated if they are stored in the same
storage-directory with the same platform-id.Since we dropped file-names and paths, we cannot serve an RPM repository
from this checksum-based storage. Therefore, we create shim wrapping layers
that refer to this storage. Indata/ref/<platform-id>/<snapshot-id>/...
we store the entire RPM repository, but with empty files. We then attach
AWS S3 metadata to all these empty files and fill in the checksum of their
content. This way, all objects underneath thedata/ref/...directory is
empty, and thus free of charge.Our frontend thus only needs to redirect requests from
data/ref/to
the correct underlying file, by reading the checksum metadata. -
Gateway
The frontend to the RPM repository snapshots is a simple HTTP REST API. It
uses AWS API Gateway to create a simple catch-all REST API that forwards
all HTTP requests to an AWS Lambda script. This script is sourced from
./src/gateway/in this repository.This scripts provides a multitude of legacy interfaces for all kinds of
operations. See its implementation for details. Its main job is to read
requests to a snapshot, find the file indata/ref/..., read the checksum
from the metadata of this empty file, and then return a 301 HTTP redirect
to the right file indata/<storage>/<platform-id>/sha256-<checksum>. It
is then the client's job to follow this redirect and directly download the
file.Note that we simply redirect clients to the public HTTP interface to AWS
S3. The gateway never transmits any data of the repositories. This keeps
our charges low and makes sure large files are always directly transferred
between AWS S3 and the client.Several paths in the
rpmrepo-storageS3 bucket are publicly accessible.
In particular,data/public/,data/ref/, anddata/thread/. The
data/rhvpn/path is NOT publicly accessible. Instead, we have an AWS
VPC Endpoint that opens up this path to all clients from within the RH
VPN. Hence, data stored in this directory is only accessible from within
RH.
Note thatdata/ref/is public, and as such all snapshots can be listed
and enumerated publicly. Only the file content is possibly protected from
public access. This is intentional, but can be changed in the future if
it poses a problem.Apart from redirects, the gateway also provides utility functions to
enumerate all snapshots, or redirect to old legacy storage locations of
older RPMrepo revisions. -
Snapshot Routine
As a single snapshot operation requires a lot of storage and time, we use
custom infrastructure to run this. This used to be Beaker, but for better
reliability, we now schedule the snapshot jobs on AWS Batch. The
previously mentionedrpmrepo-snapshotcontainer is scheduled on AWS Batch
and then will create the requested snapshots.The snapshot routine can be scheduled as a single job, or as an array job.
If scheduled as single job, you must specify the name of the target
configuration in./repo/to run. If scheduled as an array job, you should
size the array as big as the number of files in./repo/(bigger is fine,
those excess jobs will be no-ops; smaller is less fine, as it will miss
snapshots). The array jobs will then each pick one file in./repo/based
on their ARRAY-JOB-ID.Furthermore, the snapshot routine requires you to specify the branch and
commit of therpmreporepository to use. You can usemain+HEAD, but
this will be subject to concurrent changes in the upstream repository. You
are strongly advised to usemain+<commit-sha>instead.Lastly, you can specify the suffix to be used for the snapshots. If you
specifyauto, it will use the current date and time (except for singleton
snapshots; see above). You should specify this suffix manually to make
sure all snapshots share a suffix. Otherwise, updating users will be a
hassle.The AWS Batch interface will allow you to track all the snapshot jobs, see
which failed, and allow you to reschedule individual jobs, if desired. -
Updating snapshot configurations
The
./repo/directory contains all the configuration files for the
snapshots. Each file is a JSON file that specifies the configuration for
one snapshot.Individual snapshot configurations can be generated using the helper script
./gen-repos.py. Multiple snapshot configurations can be generated by
defining them in./repo-definitions.yamland then running
./gen-all-repos.py, which internally calls./gen-repos.py.Updating the snapshot configurations usually consists of deleting unused
configurations, and adding new ones. The most convenient way to do this is
to update the./repo-definitions.yamlfile, and then run the
snapshot-configsMakefile target. This will automatically delete all
configurations that are no longer present in the definition file, and
generate new ones.
List Available Snapshots
If you just need a list of the available snapshots you can query the API like
this:
curl https://rpmrepo.osbuild.org/v2/enumerate | jq .
Which will return a JSON list of the snapshots names.
Repository:
- web: https://github.com/osbuild/rpmrepo
- https:
https://github.com/osbuild/rpmrepo.git - ssh:
git@github.com:osbuild/rpmrepo.git
License:
- Apache-2.0
- See LICENSE file for details.