Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

An Experimental Study on Harassment Moderation in Llama and Alpaca

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • بيانات النشر:
      Multidisciplinary Digital Publishing Institute
    • الموضوع:
      2026
    • Collection:
      MDPI Open Access Publishing
    • نبذة مختصرة :
      The growing integration of chatbots and large language models (LLMs) into society raises important concerns about their potential to reproduce toxic human behaviors. As a result, it is essential to investigate these models to mitigate or eliminate such risks. This paper presents an experimental study evaluating the responses of the Llama and Alpaca models to scenarios involving verbal harassment. The methodology involved using harassment dialogues generated by an LLM as prompts to elicit responses from both models. The responses were then analyzed for levels of toxicity, sexually explicit content, and flirtatiousness. The results indicate that although both models reduce explicit offensive terms, they exhibit limitations in identifying and intercepting abusive behavior from users. Statistical analysis reveals that general-purpose instruction tuning in Alpaca does not provide a robust safety barrier compared to the Llama base model for most variables investigated in the experiment. However, a significant difference was observed concerning flirting, where Llama proved more prone to validation and encouragement than Alpaca. Furthermore, the study identifies critical vulnerabilities, such as a “self-deprecation” bias in Llama and “mirroring” behavior in Alpaca. We also report a complementary triangulation with GPT-family models as a secondary point of reference. This paper discusses and contains content that can be offensive or upsetting.
    • File Description:
      application/pdf
    • Relation:
      https://dx.doi.org/10.3390/bdcc10040100
    • الرقم المعرف:
      10.3390/bdcc10040100
    • الدخول الالكتروني :
      https://doi.org/10.3390/bdcc10040100
    • Rights:
      https://creativecommons.org/licenses/by/4.0/
    • الرقم المعرف:
      edsbas.1BB44FC9