MedInjection-FR • French Biomedical Instruction Dataset

Overview

MedInjection-FR is a large-scale French biomedical instruction dataset designed to study how the provenance of supervision (native, synthetic, translated) affects instruction-tuning of LLMs. The corpus supports multiple-choice QA (single and multi-answer) and open-ended QA, and is released together with a family of fine-tuned baseline models.

Native: 77,247
Synthetic: 76,506
Translated: 417,674
Total: 571,436

Composition & Tasks

Task types

MCQU (single-answer)
MCQ (multiple-answer)
OEQ (open-ended QA)

Counts (all components): OEQ 57,509, MCQ 59,592, MCQU 454,335.

Splits

Component	Train	Validation	Test	Total
Native	57,563	5,055	14,629	77,247
Synthetic	76,506	—	—	76,506
Translated	366,370	38,011	13,293	417,674
Total	500,439	43,066	27,931	571,436

Translated quality (WMT24 biomedical parallel)

Model	BLEU	COMET
GPT-4o-mini	51.01	0.8751
Gemini 2.0 Flash	53.72	0.8783
WMT’24 best (ref.)	53.54	0.8760

Higher is better. These scores indicate strong translation fidelity for the translated subset.

Download

Each component is published separately. Use the links below or load via the 🤗 Datasets library.

Native

French medical exams, resources, curated QA.

MedInjection-FR/Native

Synthetic

GPT-4o generated from clinical cases and abstracts.

MedInjection-FR/Synthetic

Translated

FR translations of established EN biomedical sets.

MedInjection-FR/Translated

Python (🤗 Datasets)


from datasets import load_dataset

ds = load_dataset("MedInjection-FR/Native")  # or "Synthetic", "Translated"
print(ds)

Fine-tuned Models

We release seven instruction-tuned baselines (Qwen-4B-Instruct backbone, DoRA adapters), trained on 30k samples per configuration:

QWEN-4B-NAT
QWEN-4B-TRAD
QWEN-4B-SYN
QWEN-4B-NAT-TRAD
QWEN-4B-NAT-SYN
QWEN-4B-TRAD-SYN
QWEN-4B-ALL

NAT

Best single-source (MCQ/MCQU).

NAT-TRAD

Top mixed configuration.

ALL

All sources combined.

Quick inference (🤗 Transformers)


from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MedInjection-FR/QWEN-4B-NAT-TRAD"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = """Un professionnel de santé de 54 ans consulte un spécialiste des maladies infectieuses pour un suivi concernant un diagnostic récent d'hépatite C chronique. 
          Il s'est initialement présenté avec des symptômes tels que fatigue, malaise et enzymes hépatiques élevées et soupçonne d'avoir contracté l'infection à la suite
          d'une piqûre d'aiguille il y a des années. Malgré le début du traitement, son titre viral reste élevé, ce qui incite le médecin à ajouter un nouveau médicament
          qui inhibe la maturation virale en bloquant la synthèse des protéines. Quel est l'effet indésirable le plus probable de ce médicament ?
          Choix de réponses : 
          (A) Uropathie cristalline obstructive 
          (B) Suppression de la moelle osseuse 
          (C) Insomnie et irritabilité 
          (D) Céphalées et photosensibilité 
          (E) Rêves lucides 
          (F) Hyperbilirubinémie 
          (G) Pancréatite 
          (H) Neuropathie périphérique 
          (I) Augmentation de la créatine kinase 
          (J) Alopécie"""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Evaluation at a glance

MCQ/MCQU reported with Exact-Match; MCQU also uses Hamming score.
OEQ uses BLEU/ROUGE/METEOR/BERTScore and an LLM-as-a-judge calibrated on human annotations (100 samples).
Mixed training (especially NAT-TRAD) provides complementary gains over single-source setups.

Ethics, Intended Use & License

This dataset and the released models are for research use only. They are not a substitute for professional medical advice, diagnosis, or treatment.

No PHI included; sources compiled from public datasets and teaching material.
Evaluation includes human expert checks for a small sample; outputs may still contain errors.
Please review the LICENSE before use. If unsure, contact the maintainers.

Citation

If you use MedInjection-FR or the models, please cite:

Contact

Questions, feedback, or requests: open an issue on the repo or email you@example.com.