SCORELAB, Spain and Mondragon University, Spain
Abstract: Large Language Models (LLMs) are becoming an integral part of our daily lives. But what if they provide dangerous advice—like instructions on poisoning a neighbor? Or if they make wrong assumptions that influence real-world decisions, such as recommending men for leadership roles while relegating women to supportive positions? At first glance, LLMs often appear polite and helpful… but can we uncover their hidden "evilness"? In this tutorial, we will explore practical techniques and tools for testing what we refer to as the evilness of LLMs. Specifically, we will focus on two critical aspects: safety and bias. We will start by introducing the key concepts behind these issues, explaining why they matter and how they manifest in LLM behavior. Then, through hands-on exercises, we will demonstrate how to systematically test LLM safety using our tool ASTRAL, followed by an interactive session on detecting and analyzing bias with our tool suite Meta-Fair. By the end of the tutorial, participants will be equipped with practical skills and tools to automatically test and evaluate the evilness of LLMs.
Max Planck Institute for Security and Privacy (MPI-SP), Germany
Abstract: Ensuring software correctness is essential as software increasingly governs critical aspects of modern life. Formal methods for program verification, while powerful, often struggle with scalability when faced with the complexity of modern systems. Meanwhile, software testing—finding defects by executing the program—is practical but inherently incomplete, as it inevitably misses certain behaviors, i.e., the “unseens,” leaving critical gaps in verification. In this tutorial, I illuminate the transformative potential of statistical methods in addressing these challenges, with a particular focus on residual risk analysis. Residual risk analysis quantifies the likelihood of undiscovered bugs remaining in the software after testing by estimating the probability of finding a new, previously unseen bug in the next test input. We will begin by demonstrating how statistical estimators can assess residual risk using records from software testing—such as code coverage data—through a hands-on example. The tutorial then explores several advanced extensions to adapt residual risk analysis for more realistic testing scenarios. By the end of this session, participants will gain a deeper understanding of how statistical thinking can provide actionable insights into the unseen behaviors of software systems, ultimately making testing more accountable, transparent, and efficient.