GitHunt
BR

brinkmanlab/MicrobeDB

Curated mirror of RefSeq Microbial Genomes. Available via CVMFS repository.

MicrobeDB

How to access

MicrobeDB is distributed using the CERN VM File System (CVMFS). Docker and CSI deployment recipes are available in ./destinations. The recipes are
executed by Terraform.

Docker may fail to unmount CVMFS during shutdown, run sudo fusermount -u ./microbedb/mount if you encounter transport endpoint is not connected
errors.

OSX Peculiarities

OSX does not natively support Docker, it runs Docker within a Linux virtual machine. This workaround means that support is limited to only the most
basic use case. While mounting MicrobeDB via CVMFS, it will fail with an error.

To work around this CVMFS must be installed and configured manually. First ensure that FUSE is enabled by
running kextstat | grep -i fuse. Download the CVMFS package. Install the pkg and
reboot. Copy ../destinations/docker/cvmfs.config to /etc/cvmfs/default.local.
Copy ./microbedb.brinkmanlab.ca.pub to /etc/cvmfs/keys/microbedb.brinkmanlab.ca.pub. Ensure everything is
configured properly by running sudo cvmfs_config chksetup. You MUST mount the CVMFS repository under a shared folder as configured in your
Docker settings for it to be accessible by Docker. By default /tmp should be included as a shared folder and you can mount the repository
to /tmp/microbedb. Ensure /tmp/microbedb exists and run sudo mount -t cvmfs microbedb.brinkmanlab.ca /tmp/microbedb.

Schema documentation

Run sqlite3 microbedb.sqlite '.schema' to view documentation of the various tables and columns. The assembly table is largely undocumented because
NCBI does not document their data schemas.

Working with taxonomy data

Use SQLite recursive query to determine if tax_id is subclass of ancestor. The following returns 1 if the query_tax_id is a subclass of
ancestor_tax_id:

WITH RECURSIVE subClassOf(n) AS (
    VALUES (query_tax_id)
    UNION
    SELECT parent_tax_id
    FROM taxonomy_nodes,
         subClassOf
    WHERE taxonomy_nodes.tax_id = subClassOf.n
      AND taxonomy_nodes.tax_id != ancestor_tax_id
)
SELECT 1
FROM subClassOf
WHERE n = ancestor_tax_id
LIMIT 1;

Build requirements

Ensure the find command supports -empty by running find --help | grep '-empty'. The most recent CVMFS commit of the repository must be mounted
on all compute nodes.
cvmfs_config must be accessible on all compute nodes.

Project Layout

  • destinations/* - terraform modules to deploy a CVMFS client configured with microbedb to various environments
  • update.sh - Script to sync data with NCBI for a CVMFS server
  • init_env.sh - Script to install dependencies for update.sh
  • fetch.sh - Executed by update.sh per chunk of datasets returned by Entrez
  • finalize.sh - Executed by update.sh once all invocations of fetch.sh have completed
  • resume.sh - Script to allow resuming execution of fetch.sh invocations in the event that any fail. This script is copied to the job directory
    by update.sh and is intended to be executed from there.
  • schema.sql - Database schema
  • temp_tables.sql - Temporary table schema used by fetch.sh
  • subclassOf.sh - Example utility to query database taxonomy data

Languages

Shell92.3%HCL7.7%

Contributors

MIT License
Created January 26, 2021
Updated July 10, 2022
brinkmanlab/MicrobeDB | GitHunt