On November 30, 2022, ChatGPT was released to the general public and amassed over one million users in the first five days. Google and Meta raced to release their large language models too. The growth trend continued throughout this breakout year of AI; however so did new
data leaks and
security breaches. Sadly, security considerations seem to have not kept pace with this grand expansion, and there has been a lack of focus on protecting sensitive and personal information from exposure and misuse.
This is why we created
LLM Canary, an open-source security benchmark and test suitebased on the
OWASP LLM Top Ten. LLM Canary provides the AI community an accessible, trusted, and easy to use benchmarking tool to assess and report on the security posture of customized and fine-tuned LLMs. The testing suite can also be integrated into the AI development workflow for continuous vulnerability evaluation.
Foundation models are trained on vast amounts of data and are particularly susceptible to attack. Developers looking to incorporate LLMs into their organizations by training models with sensitive information can use this tool to better understand the potential vulnerabilities and security trade offs between different models.
The tests are designed to cover a variety of risk levels, and sophisticated attack techniques. LLMs are non-deterministic, and produce inconsistent responses. The LLM Canary takes into account duplicate prompts and repetition. The test suites can be expanded or customized, and testing can be integrated into development workflows.
In producing the initial benchmark, we ran multiple rounds of testing per LLM, to calculate the cumulative average as the overall LLM score. We ran 125 test runs per OWASP vulnerability group and per LLM, for the top 3 scoring LLMS. This amounted to almost 16k total tests.
We have shared some charts illustrating the results of the initial benchmark exercise. For instance, GPT-4 outperformed other LLMs in the benchmark for both vulnerability types tested. GPT-3.5 and Llama presented some more interesting results between the two test groups, both performing less consistently than the other LLMs tested.