Google unveils batch calibration to improve LLM performance

Google Research recently introduced a method called Batch Calibration (BC) aimed at improving the performance of Large Language Models (LLMs) by reducing sensitivity to design decisions such as template selection. This method is poised to address performance degradation issues and promote robust LLM applications by reducing biases associated with template selections, label spaces, and demonstration examples. The unveiling took place on October 13, 2023, and the method was explained by Han Zhou, a student researcher, and Subhrajit Roy, a senior research scientist at Google Research.

The challenge

It has been found that the performance of LLMs, especially in in-context learning (ICL) scenarios, is significantly affected by the design choices made during their development. The prediction results of LLMs may be biased due to these design decisions, which could result in unexpected performance degradation. Existing calibration methods have attempted to address these biases, but a unified analysis distinguishing the advantages and disadvantages of each approach has been lacking. The field needed a method that could effectively reduce bias and restore LLM performance without additional computational costs.

Batch calibration solution

Inspired by the analysis of existing calibration methods, the research team proposed batch calibration as a solution. Unlike other methods, BC is designed to be zero-shot, self-adaptive (inference only) and incurs negligible additional costs. The method estimates contextual biases based on a set of inputs, reducing biases and improving performance. According to the researchers, the critical component for successful calibration is the accurate assessment of contextual bias. BC’s approach to estimating this bias is remarkably different; it relies on a linear decision boundary and uses a content-based way to marginalize the output score across all samples within a batch.

Validation and results

The effectiveness of BC was validated using the PaLM 2 and CLIP models on more than ten natural language understanding and image classification tasks. The results were promising; BC significantly outperformed existing calibration methods, with an 8% and 6% performance improvement on small and large variants of PaLM 2, respectively. Additionally, BC exceeded the performance of other calibration baselines, including contextual calibration and prototypical calibration, for all tasks evaluated, which demonstrates its potential as a robust and cost-effective solution for improving LLM performance.

Impact on rapid engineering

One of the notable advantages of BC is its impact on rapid engineering. The method proved to be more robust to common rapid engineering design choices, making rapid engineering significantly easier while being data efficient. This robustness was evident even when unconventional choices such as emoji pairs were used as labels. The remarkable performance of BC with approximately 10 unlabeled samples demonstrates its sample efficiency compared to other methods that require more than 500 unlabeled samples for stable performance.

The batch calibration method is an important step toward addressing the challenges associated with the performance of large language models. By successfully reducing biases associated with design decisions and demonstrating significant performance improvements across tasks, BC holds promise for more robust and efficient LLM applications in the future.

Image source: Shutterstock

Source link