crux82/MM-IGLU-Dialogues
Multimodal dialogue planning for interactive agents in Minecraft-like worlds.
Training Multi-Modal LLMs through Dialogue Planning for HRI
Introduction
In this repository, we share the training, testing, and evaluation code for Training Multi-Modal LLMs through Dialogue Planning for HRI accepted at ACL 2025, and we report only the method and results for MiniCPM since LLaVA was used just as a comparison for the first step.
The goal of this project is to develop a Multi-Modal Large Language Model (MLLM) capable of engaging in a dialogue within a Minecraft-like environment.
The interaction is based on an image of the world and user-provided commands. The user acts as the Architect, dictating tasks to be performed, while the MLLM acts as the Builder, gathering necessary information by observing the world or asking clarifying questions. The system will then confirm or reject the command.
Our experiments demonstrate that incorporating a structured planning phase during the dialogue improves the clarity, ease, and overall effectiveness of the interaction. The released systems are trained on the dataset to generate either relevant questions or confirmations about the executability of a command.
The repository contains two systems:
- Fine-tuned with Plan: This system works in two steps:
- Step 1: Given the dialogue history, the image, and the last user utterance, it generates a plan for gathering the required information.
- Step 2: Using the dialogue history, the image, the last user utterance, and the generated plan, the system produces a relevant question based on the first category of the plan.
- Fine-tuned no Plan: This system directly generates a question or a confirmation for the user without planning the dialogue.
Pre-requisites
- Download the base model: obtain the MiniCPM-V 2.6 base model from the Huggingface repository and store the model in your local disk
- Edit the base model: in the model main folder replace the modeling_minicpmv.py file with the customized one that you can find in this repo under the folder minicpm_edited_files.
Training
To train the models, follow these steps:
-
Set up the environment:
Create a virtual environment and install the required packages using the following commands:# Create new environment conda create --name minicpm_env python=3.10 -y #activate environment conda activate minicpm_env #Instal packages pip install -r requirements_sft.txt -
Run the training scripts:
Use the appropriate bash script to train the desired model (split for both English and Italian). Run in the finetuning/finetune folder:- Fine-tuned with Plan (EN):
sh finetune_lora_ENG_history_plan.sh - Fine-tuned no Plan (EN):
sh finetune_lora_ENG_history_noplan.sh - Fine-tuned with Plan (IT):
sh finetune_lora_ITA_history_plan.sh - Fine-tuned no Plan (IT):
sh finetune_lora_ITA_history_noplan.sh
Arguments:
- MODEL: path to the base MiniCMP-V 2.6 model
- DATA: path to the train data folder
- EVAL_DATA: path to the eval data folder
- LLM_TYPE: defines how to process the data and construct the appropriate conversation input to train the model. It is set to the custom implementation 'qwen2_mmiglu'.
- tune_vision: boolean. Enable the SigClip VIsion Encoder weights adjustment
- output_dir: define the directory to save LoRA weights
- Fine-tuned with Plan (EN):
Testing
To evaluate the models, generate answers for the test set. The evaluation metrics will be computed in the next step.
In the evaluation folder:
-
Run the testing scripts:
Use the following bash scripts:- Fine-tuned with Plan:
sh evaluation_plan_model.sh - Fine-tuned no Plan:
sh evaluation_noplan_model.sh - 0-Shot (a simple baseline):
sh evaluation_ZS_model.sh
Arguments:
- LANGUAGE:the language configuration for evaluation. Allowed parameters are:'en', 'it', 'en2it', 'it2en'.
'en' and 'it' evaluate the model trained and tested in the same language.
'en2it' and 'it2en' are used for cross-lingual evaluation, where the model is trained in one language but tested in another. - NAME: name for the output .tsv file(s), which is(are) used in the next step
- MODEL_TYPE: path to the base model
- PATH_TO_ADAPTER: path to the LoRA weights
- Fine-tuned with Plan:
Evaluation
-
Set up the environment:
Create a virtual environment and install the required packages using the following commands:# Create new environment conda create --name eval_env python=3.10 -y #activate environment conda activate eval_env #Instal packages pip install -r requirements_eval.txt
To produce the evaluation metrics, run the provided scripts, in the evaluation folder:
-
Evaluation scripts:
name_of_tsv_file.tsvshould be the same as the output of the previous step- Compute Plan metrics:
python calculate_plan_metrics.py name_of_tsv_file.tsv- Compute metrics about Clarifications Request quality:
python question_accuracy.py name_of_tsv_file.tsv- Compute metrics about Recap quality:
python calculate_conf_recap_metrics.py name_of_tsv_file.tsv
DATA
Introduction
In this repository, we share also the dataset related to Training Multi-Modal LLMs through Dialogue Planning for HRI.
The dataset focuses on Dialogical Interaction in a Minecraft-like environment and is bilingual, i.e., it contains dialogues in both Italian and English. The data is divided into train/eval/test sets with a ratio of 80/10/10.
Dataset Structure
The dataset is structured as follows:
- data
|--images
|--it
|--plan
|--train.json
|--eval.json
|--test.json
|--noplan
|--train.json
|--eval.json
|--test.json
|--en
|--plan
|--train.json
|--eval.json
|--test.json
|--noplan
|--train.json
|--eval.json
|--test.json
Example Dialogue
Below is an example of a short dialogue extracted from the dataset: a dialogue between a user and an assistant interacting with a block environment. The user provides commands on how to arrange the blocks, while the assistant interprets the request and generates a response based on its understanding of the scene.
"id": "CQ-game-2438_068816_noplan",
"image": "../../images/CQ-game-2438.png",
"lang": "en",
"image_id": "CQ-game-2438",
"conversations": [
{
"role": "user",
"content": "Add blocks above the easternmost purple column"
},
{
"role": "assistant",
"content": "How many blocks do I need to add above the easternmost purple column?"
},
{
"role": "user",
"content": "Three"
},
{
"role": "assistant",
"content": "What color should the three blocks be?"
},
{
"role": "user",
"content": "Yellow"
},
{
"role": "assistant",
"content": "How should I arrange the three yellow blocks?"
},
{
"role": "user",
"content": "Vertically"
},
{
"role": "assistant",
"content": "Ok, I'll add three yellow blocks arranged vertically above the easternmost purple column"
}
Below is an example of a short dialogue extracted from the dataset: a dialogue between a user and assistant interacting with a block environment. The user provides a command to add blocks, and the assistant responds by generating a structured plan to clarify ambiguities before executing the request. The conversation follows a step-by-step process where the assistant asks for missing details (such as the number, color, and arrangement of the blocks) before confirming the final action.
Example Dialogue Plan Augmented
"id": "CQ-game-2438_068816_plan",
"image": "../../images/CQ-game-2438.png",
"lang": "en",
"image_id": "CQ-game-2438",
"conversations": [
{
"role": "user",
"content": "Add blocks above the easternmost purple column"
},
{
"role": "plan",
"content": "['NUMBER', 'COLOR', 'DISPOSITION', 'CONFIRMATION WITH RECAP']"
},
{
"role": "assistant",
"content": "How many blocks do I need to add above the easternmost purple column?"
},
{
"role": "user",
"content": "Three"
},
{
"role": "plan",
"content": "['COLOR', 'DISPOSITION', 'CONFIRMATION WITH RECAP']"
},
{
"role": "assistant",
"content": "What color should the three blocks be?"
},
{
"role": "user",
"content": "Yellow"
},
{
"role": "plan",
"content": "['DISPOSITION', 'CONFIRMATION WITH RECAP']"
},
{
"role": "assistant",
"content": "How should I arrange the three yellow blocks?"
},
{
"role": "user",
"content": "Vertically"
},
{
"role": "plan",
"content": "['CONFIRMATION WITH RECAP']"
},
{
"role": "assistant",
"content": "Ok, I'll add three yellow blocks arranged vertically above the easternmost purple column"
}
Images
Due to space constraints, only a subset of the images is included in this repository to give an overview of the dataset.
Example Images
Below are three sample images from the dataset:
Citation
@inproceedings{hromei-etal-2025-training,
title = "Training Multi-Modal {LLM}s through Dialogue Planning for {HRI}",
author = "Hromei, Claudiu Daniel and
Borazio, Federico and
Sensi, Andrea and
Passone, Elisa and
Croce, Danilo and
Basili, Roberto",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.837/",
doi = "10.18653/v1/2025.findings-acl.837",
pages = "16266--16284",
ISBN = "979-8-89176-256-5",
}



