GitHunt
PI

Comparing and combining multiple metagenomic datasets

Contributors :
Pierre PETERLONGO, pierre.peterlongo@inria.fr [24/07/14]
Nicolas MAILLET, nicolas.maillet@inria.fr [24/07/14]
Guillaume Collet, guillaume@gcollet.fr [24/07/14]

————————————————————————————————————————————————————————————————————————————————
At a glance:
Commet is used for de novo comparisons of raw read sets. It is the successor of
the compareads tool.
————————————————————————————————————————————————————————————————————————————————

This software is a computer program whose purpose is to find all the similar
reads between two set of NGS reads. It also provide a similarity score between
the two samples.

Copyright (C) 2014 INRIA

This program is free software: you can redistribute it and/or modify it under
the terms of the GNU Affero General Public License as published by the Free
Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along
with this program. If not, see http://www.gnu.org/licenses/.

————————————————————————————————————————————————————————————————————————————————
REQUIREMENTS

  • C++ compiler (g++-4.9, clang-3.2…)
  • python 2.7
  • R

————————————————————————————————————————————————————————————————————————————————
INSTALLATION

Commet's applications have been written in C++.
To compile those programs, a makefile is provided using g++:

make

If you want to install in /usr/local/bin, run with root permissions (use sudo):

make
sudo make install

Then, you can test Commet on the ABCDE_bench data by running the following
command:

python Commet.py ABCDE_bench/sets_config.txt -k 32

————————————————————————————————————————————————————————————————————————————————
USER GUIDE

For complete descriptions of modules and usage, a user guide is provided in the
doc directory.

————————————————————————————————————————————————————————————————————————————————
SCRIPTS

Commet.py:
- input:
- A file containing a list of read sets (see below)
- output:
- a bit vector for each intersection of pairs of read sets
- 3 matrices in CVS format
- plain: line i, column j contains the number of reads
from set i similar to at least one read of set j
- percentage: line i, column j contains the percentage
of reads from set i similar to at least one read of
set j
- normalized: line i, column j contains the percentage
of reads similar between sets i and j with respect to
the total number of reads in sets i and j.
- 3 heatmaps (png format), computed from the three previously
mentioned matrices
- A dendrogram (png format) of input reads obtained from the
normalized matrix, by hierarchical complete clusterization of
all input reads.
- misc.:
- This script needs python 2.7
- This script can parallelize computations on SGE clusters using
the --sge option. In this case the analysis of the CVS
matrices generating the four png figures (three heatmaps and
one dendrogram) is performed once all jobs are finished using
the Commet_analysis.py script.

————————————————————————————————————————————————————————————————————————————————
LISTS OF READ SETS

Read sets may be composed of several files in different format (fasta, fastq,
gzip). To give these read sets to Commet, we use the following format:

- Each line corresponds to a read set.
- Each line begins with an identifier for the set directly followed by a
  colon (:).
- Then, each file of a set is separated by a semicolon (;).
- A bit vector may be associated to a file by adding its name after a
  comma (,).

Example:
Name1:path_set1.1.fq.gz;path_set_1.2.fq.gz,bv1.2.bv;path_set1.3.fq;...
Name2:path_set2.1.fq.gz;path_set_2.2.fq.gz
Name3:path_set3.1.fq,bv3.1.bv;path_set_3.2.fq,bv3.2.bv;path_set3.3.fq.gz
...
————————————————————————————————————————————————————————————————————————————————
APPLICATIONS SUMMARY (compiled in the 'bin' directory)

filter_reads
index_and_search
extract_reads
bvop

————————————————————————————————————————————————————————————————————————————————
FILTER_READS

  • input:

    • A file containing reads (nucleotide sequences) in fasta or fastq format.
      The input file may be compressed with gzip algorithm.
    • Four parameters for selection : - minimum size [default=0]
      - maximum number of N [default=any]
      - minimum Shannon index [default=0]
      - maximum selected reads [default=all].
  • output:

    • A file containing a vector of bits.
      Each bit of the vector corresponds to a read in the input file.
      If the bit is 1 then the read is selected, if 0 the read was filtered out.

————————————————————————————————————————————————————————————————————————————————
INDEX_AND_SEARCH

  • input:

    • A file read sets representing the reference (only the first set is used)
    • A file read sets representing the queries (-s) (all sets are searched)
  • output:

    • for each read file in the queries, a bit vector stores reads that share
      enough similarity with reads from the reference.

————————————————————————————————————————————————————————————————————————————————
BVOP

  • input:

    • A file containing a vector of bits.
    • A logical operation to perform (with another bit vector optionaly)
  • output:

    • A file containing the resulting bit vector.

————————————————————————————————————————————————————————————————————————————————
EXTRACT_READS

  • input:
    • A file containing reads (nucleotide sequences) in fasta or fastq format.
      The input file may be compressed with gzip algorithm.

    • A file containing a vector of bits of the same size.

    • output:

      • A file containing only reads that correspond to a 1 in the bit vector

Languages

C++77.5%Python17.9%R3.1%Makefile1.4%HTML0.0%

Contributors

Created July 8, 2014
Updated December 30, 2024
pierrepeterlongo/commet | GitHunt