GitHunt

Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach

arXiv
Platform
Python
CC BY 4.0

LAM Architecture

Installation

See installation instructions.

Datasets

Dataset Format Type URL
EigenScape em32 real Link
STARSS23 mic & em32 real Link
LOCATA em32 real Link
SpatialScaper Simulated Audio mic & em32 synthetic Link

Generate dataset

See more details on how to generate the HDF dataset.

Training

Use train.py to train the model.

  • -h, display help information
  • -C, --config, specify the configuration file required for training
  • -R, --resume, continue training from the checkpoint of the last saved model

Please refer to the config files config/train/README to understand how to setup your training config.

Example:

# The configuration file used to train the model is "config/train/train.json"
python train.py -C config/train/train.json

# continue training from the last saved model checkpoint
python train.py -C config/train/train.json -R

Inference

Use infer.py to run inference with a pre-trained model.

  • -h, display help information
  • -D, --device, GPU index to be use (0 for single GPU / default)
  • -C, --config, Configuration for k-means inference (*.json).

Please refer to the config files config/inference/README to understand how to setup your inference config.

python infer.py -C /path/to/config/inference.json -D 0

Example:

python infer.py -C config/inference/inference.json -D 0

DoA Metrics from Infered K-means Output

python doa_metrics.py -C /path/to/config/inference.json

Sound Event Localization using LAM

Use LAM's spherical acoustic maps (SAMs) as features to a SELD network (DCASE-style). Please refer to the seld directory, where you can perform batch feature extraction of SAMS and then train a network to perform DOA on datasets like STARSS23 or LOCATA.

Visualization

Training Curves (TensorBoard)

# Run tensorboard pointing to your directory of logs generated during training
tensorboard --logdir train

# You can use --port to specify the port of the tensorboard static server
tensorboard --logdir train --port <port> --bind_all

Acoustic Map Visualization

Use infer_visualize.py to run inference and save spherical acoustic maps (SAMs) as PNG images.
One image is produced per time frame (default: 10 ms) and written to the directory specified by output_dir in the config.

Arguments

Flag Short Description
--config -C Path to inference config JSON (same schema as infer.py)
--device -D GPU index (default: 0). Pass cpu to run on CPU.
--per-band -B Save one map per frequency band instead of a single combined RGB image.

Combined RGB mode (default) — all frequency bands are collapsed into a single RGB image via to_RGB() and one PNG per frame is saved:

python infer_visualize.py -C config/inference/infer_kitchensink_eval_locata.json -D 0

Output layout:

<output_dir>/
└── <clip_name>/
    ├── frame_0000_000000ms.png
    ├── frame_0001_000010ms.png
    └── ...

Per-band mode (--per-band / -B) — one greyscale map per frequency band per frame:

python infer_visualize.py -C config/inference/infer_kitchensink_eval_locata.json -D 0 --per-band

Output layout:

<output_dir>/
└── <clip_name>/
    └── bands/
        ├── band00/
        │   ├── frame_0000_000000ms_band00.png
        │   └── ...
        ├── band01/
        └── ...

Config keys (optional, can also be set via CLI flags):

Key Default Description
"per_band" false Enable per-band mode (equivalent to --per-band)
"T_sti_ms" 10 Frame duration in ms; must match T_sti used in get_visibility_matrix

Example config for the pre-trained LAM model:

{
    "model": {
        "module": "model.LAM",
        "main": "LAM",
        "args": {}
    },
    "dataset": {
        "module": "dataset.inference_dataloader",
        "main": "InferenceDataset",
        "args": {
            "dataset": "/path/to/audio/files"
        }
    },
    "model_path": "checkpoints/LAM.pth",
    "output_dir": "output_visualize_LAM",
    "FS": 24000,
    "n_max": 3
}

Pre-trained Models

Model Input Checkpoint
UpLAM 4-channel UpLAM.pth
LAM 32-channel LAM.pth

Citation

If you find our work useful, please cite our paper:

@article{roman2025latent,
  title={Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach},
  author={Roman, Adrian S, Roman, Iran R and Bello, Juan P},
  journal={IEEE Workshop on Appplications of Signal Processing to Audio and Acoustics},
  year={2025}
}
adrianSRoman/LAM | GitHunt