Overview

MedInjection-FR is a large-scale French biomedical instruction dataset designed to study how the provenance of supervision (native, synthetic, translated) affects instruction-tuning of LLMs. The corpus supports multiple-choice QA (single and multi-answer) and open-ended QA, and is released together with a family of fine-tuned baseline models.

Composition & Tasks

Task types

  • MCQU (single-answer)
  • MCQ (multiple-answer)
  • OEQ (open-ended QA)

Counts (all components): OEQ 57,509, MCQ 59,592, MCQU 454,335.

Splits

ComponentTrainValidationTestTotal
Native57,5635,05514,62977,247
Synthetic76,50676,506
Translated366,370 38,01113,293417,674
Total500,43943,06627,931571,436

Translated quality (WMT24 biomedical parallel)

ModelBLEUCOMET
GPT-4o-mini51.010.8751
Gemini 2.0 Flash53.720.8783
WMT’24 best (ref.)53.540.8760

Higher is better. These scores indicate strong translation fidelity for the translated subset.

Download

Each component is published separately. Use the links below or load via the 🤗 Datasets library.

Python (🤗 Datasets)


from datasets import load_dataset

ds = load_dataset("MedInjection-FR/Native")  # or "Synthetic", "Translated"
print(ds)
        

Fine-tuned Models

We release seven instruction-tuned baselines (Qwen-4B-Instruct backbone, DoRA adapters), trained on 30k samples per configuration:

Quick inference (🤗 Transformers)


from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MedInjection-FR/QWEN-4B-NAT-TRAD"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = """Un professionnel de santé de 54 ans consulte un spécialiste des maladies infectieuses pour un suivi concernant un diagnostic récent d'hépatite C chronique. 
          Il s'est initialement présenté avec des symptômes tels que fatigue, malaise et enzymes hépatiques élevées et soupçonne d'avoir contracté l'infection à la suite
          d'une piqûre d'aiguille il y a des années. Malgré le début du traitement, son titre viral reste élevé, ce qui incite le médecin à ajouter un nouveau médicament
          qui inhibe la maturation virale en bloquant la synthèse des protéines. Quel est l'effet indésirable le plus probable de ce médicament ?
          Choix de réponses : 
          (A) Uropathie cristalline obstructive 
          (B) Suppression de la moelle osseuse 
          (C) Insomnie et irritabilité 
          (D) Céphalées et photosensibilité 
          (E) Rêves lucides 
          (F) Hyperbilirubinémie 
          (G) Pancréatite 
          (H) Neuropathie périphérique 
          (I) Augmentation de la créatine kinase 
          (J) Alopécie"""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

        

Evaluation at a glance

Ethics, Intended Use & License

This dataset and the released models are for research use only. They are not a substitute for professional medical advice, diagnosis, or treatment.

Citation

If you use MedInjection-FR or the models, please cite:



        

Contact

Questions, feedback, or requests: open an issue on the repo or email you@example.com.