Abstract
We evaluate the ability of today's LLMs to perform latent multi-hop reasoning on factual knowledge, or the latent composability of the models to internally recall and compose the facts separately learned during training, by assessing the model completions on multi-hop queries like "In the year Scarlett Johansson was born, the Summer Olympics were hosted in the country of". However, it may be challenging to accurately measure the latent reasoning ability without careful data construction and evaluation. For example, models may have developed shortcuts by numerous encounters of the head entity "Scarlett Johansson" and the answer entity "United States" in the same training sequences or merely guesses the answer based on frequency-based prior without going through a multi-hop reasoning process. To take this account, we propose desiderata about the dataset and evaluation for shortcut-free evaluation of latent multi-hop reasoning ability, and construct an evaluation dataset named SOCRATES (ShOrtCut-fRee lATent rEaSoning) to measure latent composability as rigorously as possible without access to the actual pretraining data. To be specific, we select the relation compositions and facts which would minimize the chance of models exploiting shortcuts and exclude the test queries where none of the combinations of the aliases of the head and answer entity have never appeared together in the any document of multiple different pretraining corpora, and even Google Search for some cases, of which the importance is shown through comparison with the misleading results obtained from a shortcut-prone version of the data. The experimental results with the shortcut-free dataset reveal that even today's best models lack generalization in latent compositionality; the ability differs dramatically according to the type of the bridge entity. Latent composability of the best models is only about 5% for the test queries where the bridge entity is a year, but the number is above 80% when the bridge entity is a country. Latent composability tends to increase roughly linearly with respect to the number of known single-hop facts, and tends to be higher for larger models, but the rate of increment and the gain from model scale also significantly differs according to the type of the bridge entity. Comparisons with Chain-of-Thought composability highlight a significant gap between latent and explicit reasoning. By performing initial investigations and drawing connections to related works, we suggest several unlikely and plausible hypotheses on the factors that may determine latent compositionality. In sum, our work provides the resource, insights, and potential future directions for precise evaluation, understanding, and improvement of latent reasoning of LLMs.
Authors
Sohee Yang, Nora Kassner, Elena Gribovskaya, Sebastian Riedel, Mor Geva
Venue
arXiv