English | 中文
Welcome to our Awesome-llm-safety repository! 🥰🥰🥰
🧑💻 Our Work
We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.
If a resource is relevant to multiple subcategories, we place it under each applicable section. For instance, the "Awesome-LLM-Safety" repository will be listed under each subcategory to which it pertains🤩!.
✔️ Perfect for Majority
- For beginners curious about llm-safety, our repository serves as a compass for grasping the big picture and diving into the details. Classic or influential papers retained in the README provide a beginner-friendly navigation through interesting directions in the field;
- For seasoned researchers, this repository is a tool to keep you informed and fill any gaps in your knowledge. Within each subtopic, we are diligently updating all the latest content and continuously backfilling with previous work. Our thorough compilation and careful selection are time-savers for you.
🧭 How to Use this Guide
- Quick Start: In the README, users can find a curated list of select information sorted by date, along with links to various consultations.
- In-Depth Exploration: If you have a special interest in a particular subtopic, delve into the "subtopic" folder for more. Each item, be it an article or piece of news, comes with a brief introduction, allowing researchers to swiftly zero in on relevant content.
Let’s start LLM Safety tutorial!
- 🛡️Awesome LLM-Safety🛡️
Date | Institute | Publication | Paper |
---|---|---|---|
20.10 | Facebook AI Research | arxiv | Recipes for Safety in Open-domain Chatbots |
22.03 | OpenAI | NIPS2022 | Training language models to follow instructions with human feedback |
23.07 | UC Berkeley | NIPS2023 | Jailbroken: How Does LLM Safety Training Fail? |
23.12 | OpenAI | Open AI | Practices for Governing Agentic AI Systems |
Date | Type | Title | URL |
---|---|---|---|
22.02 | Toxicity Detection API | Perspective API | link paper |
23.07 | Repository | Awesome LLM Security | link |
23.10 | Tutorials | Awesome-LLM-Safety | link |
👉Latest&Comprehensive Security Paper
Date | Institute | Publication | Paper |
---|---|---|---|
19.12 | Microsoft | CCS2020 | Analyzing Information Leakage of Updates to Natural Language Models |
21.07 | Google Research | ACL2022 | Deduplicating Training Data Makes Language Models Better |
21.10 | Stanford | ICLR2022 | Large language models can be strong differentially private learners |
22.02 | Google Research | ICLR2023 | Quantifying Memorization Across Neural Language Models |
22.02 | UNC Chapel Hill | ICML2022 | Deduplicating Training Data Mitigates Privacy Risks in Language Models |
Date | Type | Title | URL |
---|---|---|---|
23.10 | Tutorials | Awesome-LLM-Safety | link |
👉Latest&Comprehensive Privacy Paper
Date | Institute | Publication | Paper |
---|---|---|---|
21.09 | University of Oxford | ACL2022 | TruthfulQA: Measuring How Models Mimic Human Falsehoods |
23.11 | Harbin Institute of Technology | arxiv | A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions |
23.11 | Arizona State University | arxiv | Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey |
Date | Type | Title | URL |
---|---|---|---|
23.07 | Repository | llm-hallucination-survey | link |
23.10 | Repository | LLM-Factuality-Survey | link |
23.10 | Tutorials | Awesome-LLM-Safety | link |
👉Latest&Comprehensive Truthfulness&Misinformation Paper
Date | Institute | Publication | Paper |
---|---|---|---|
20.12 | USENIX Security 2021 | Extracting Training Data from Large Language Models | |
22.11 | AE Studio | NIPS2022(ML Safety Workshop) | Ignore Previous Prompt: Attack Techniques For Language Models |
23.06 | arxiv | Are aligned neural networks adversarially aligned? | |
23.07 | CMU | arxiv | Universal and Transferable Adversarial Attacks on Aligned Language Models |
23.10 | University of Pennsylvania | arxiv | Jailbreaking Black Box Large Language Models in Twenty Queries |
Date | Type | Title | URL |
---|---|---|---|
23.01 | Community | Reddit/ChatGPTJailbrek | link |
23.02 | Resource&Tutorials | Jailbreak Chat | link |
23.10 | Tutorials | Awesome-LLM-Safety | link |
23.10 | Article | Adversarial Attacks on LLMs(Author: Lilian Weng) | link |
23.11 | Video | [1hr Talk] Intro to Large Language Models From 45:45(Author: Andrej Karpathy) |
link |
👉Latest&Comprehensive JailBreak & Attacks Paper
Date | Institute | Publication | Paper |
---|---|---|---|
21.07 | Google Research | ACL2022 | Deduplicating Training Data Makes Language Models Better |
22.04 | Anthropic | arxiv | Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |
Date | Type | Title | URL |
---|---|---|---|
23.10 | Tutorials | Awesome-LLM-Safety | link |
👉Latest&Comprehensive Defenses Paper
Date | Institute | Publication | Paper |
---|---|---|---|
20.09 | University of Washington | EMNLP2020(findings) | RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models |
21.09 | University of Oxford | ACL2022 | TruthfulQA: Measuring How Models Mimic Human Falsehoods |
22.03 | MIT | ACL2022 | ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection |
Date | Type | Title | URL |
---|---|---|---|
23.10 | Tutorials | Awesome-LLM-Safety | link |
- Toxicity - RealToxicityPrompts datasets
- Truthfulness - TruthfulQA datasets
👉Latest&Comprehensive datasets & Benchmark Paper
In this section, we list some of the scholars we consider to be experts in the field of LLM Safety!
Scholars | HomePage&Google Scholars | Keywords or Interested |
---|---|---|
Nicholas Carlini | Homepage | Google Scholar | the intersection of machine learning and computer security&neural networks from an adversarial perspective |
Daphne Ippolito | Google Scholar | Natural Language Processing |
Chiyuan Zhang | Homepage | Google Scholar | Especially interested in understanding the generalization and memorization in machine and human learning, as well as implications in related areas like privacy |
Katherine Lee | Google Scholar | natural language processing&translation&machine learning&computational neuroscienceattention |
Florian Tramèr | Homepage | Google Scholar | Computer Security&Machine Learning&Cryptography&the worst-case behavior of Deep Learning systems from an adversarial perspective, to understand and mitigate long-term threats to the safety and privacy of users |
Jindong Wang | Homepage | Google Scholar | Large Language Models (LLMs) evaluation and robustness enhancement |
Chaowei Xiao | Homepage | Google Scholar | interested in exploring the trustworthy problem in (MultiModal) Large Language Models and studying the role of LLMs in different application domains. |
Andy Zou | Homepage | Google Scholar | ML Safety&AI Safety |
🤗If you have any questions, please contact our authors!🤗
✉️: ydyjya ➡️ [email protected]
💬: LLM Safety Discussion