Shwetha Somasundaram

I am currently working as a Research Associate II at the Multimodal Content Experiences Lab at Adobe Research. In the last 2.5 years I’ve primarily worked with Dr. Apoorv Saxena and Dr. Balaji Srinivasan on leveraging Large Language models (LLMs)/ Multimodal Large Language Models (MLLMs) for document experience projects for Adobe Acrobat and Adobe Express. I’ve worked a wide range of research areas: retrieval and attribution for document question answering, document stylization and transformation, graphic design generation and speculative decoding. I am currently interested and working on model merging techniques for LLMs/VLMs and using model internals for interpretibility.

I completed my bachelor’s thesis under the supervision of Prof. N Venkateswaran at SSN College of Engineering. My project focused on the road object detection from radar sensor data using machine learning and deep learning object detection techniques. During my undergraduate studies, I also explored the estimation of tracer kinetic parameters from undersampled DCE-MRI data, under the supervision of Dr. Phaneendra Yalavarthy at the Medical Imaging Lab, Indian Institute of Science, Bangalore.

If you’d like to know more about my work or discuss potential collaborations, please check out my CV. I’m always open to new opportunities and interesting conversations!

Publications

2024

NAACL Findings 2025

PLD+: Accelerating LLM inference by leveraging Language Model Artifacts

Shwetha Somasundaram, Anirudh Phukan, and Apoorv Saxena

arXiv preprint arXiv:2412.01447, 2024

Abs PDF

To reduce the latency associated with autoretrogressive LLM inference, speculative decoding has emerged as a novel decoding paradigm, where future tokens are drafted and verified in parallel. However, the practical deployment of speculative decoding is hindered by its requirements for additional computational resources and fine-tuning, which limits its out-of-the-box usability. To address these challenges, we present PLD+, a suite of novel algorithms developed to accelerate the inference process of LLMs, particularly for input-guided tasks. These tasks, which include code editing, text editing, summarization, etc., often feature outputs with substantial overlap with their inputs-an attribute PLD+ is designed to exploit. PLD+ also leverages the artifacts (attention and hidden states) generated during inference to accelerate inference speed. We test our approach on five input-guided tasks and through extensive experiments we find that PLD+ outperforms all tuning-free approaches. In the greedy setting, it even outperforms the state-of-the-art tuning-dependent approach EAGLE on four of the tasks. (by a margin of upto 2.31 in terms of avg. speedup). Our approach is tuning free, does not require any additional compute and can easily be used for accelerating inference of any LLM.
AAAI 2025

PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization

Vijay Jaisankar, Sambaran Bandyopadhyay, Kalp Vyas, and 2 more authors

arXiv preprint arXiv:2405.20213, 2024

Abs PDF

A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.
ACL Findings 2024

Peering into the Mind of Language Models: An Approach for Attribution in Contextual Question Answering

Anirudh Phukan, Shwetha Somasundaram, Apoorv Saxena, and 2 more authors

In Findings of the Association for Computational Linguistics ACL 2024, 2024

Abs PDF

With the enhancement in the field of generative artificial intelligence (AI), contextual question answering has become extremely relevant. Attributing model generations to the input source document is essential to ensure trustworthiness and reliability. We observe that when large language models (LLMs) are used for contextual question answering, the output answer often consists of text copied verbatim from the input prompt which is linked together with “glue text” generated by the LLM. Motivated by this, we propose that LLMs have an inherent awareness from where the text was copied, likely captured in the hidden states of the LLM. We introduce a novel method for attribution in contextual question answering, leveraging the hidden state representations of LLMs. Our approach bypasses the need for extensive model retraining and retrieval model overhead, offering granular attributions and preserving the quality of generated answers. Our experimental results demonstrate that our method performs on par or better than GPT-4 at identifying verbatim copied segments in LLM generations and in attributing these segments to their source. Importantly, our method shows robust performance across various LLM architectures, highlighting its broad applicability. Additionally, we present Verifiability-granular, an attribution dataset which has token level annotations for LLM generations in the contextual question answering setup.
EACL Main 2024

Presentations by the Humans and For the Humans: Harnessing LLMs for Generating Persona-Aware Slides from Documents

Ishani Mondal, Shwetha S, Anandhavelu Natarajan, and 3 more authors

In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Mar 2024

Abs PDF

Scientific papers and slides are two different representations of the same underlying information, but both require substantial work to prepare. While there had been prior efforts on automating document-to-slides generation, there is still a pressing need of customizing the presentation of content aligning with the persona of target audience or duration of presentation. This paper first introduces the concept of end-user specification-aware document to slides conversion that incorporates end-user specifications into the conversion process. For this, we initially introduce a new dataset reuse the existing SciDuet dataset consisting of pairs of papers and corresponding slides decks from recent years’ *ACL conferences to create four persona-aware configurations. Secondly, we present Persona-Aware-D2S, a novel approach by finetuning LLMs using target audience feedback to create persona-aware slides from scientific documents. Our evaluation on both automated metrics and qualitative human evaluation suggests that by incorporating end-user specifications into the conversion process, our model can create presentations that are not only informative but also tailored to expectations and cognitive abilities of target audience.

2023

EMNLP Findings 2023

Drilling Down into the Discourse Structure with LLMs for Long Document Question Answering

Inderjeet Nair^*, Shwetha Somasundaram^*, Apoorv Saxena, and 1 more author

In Findings of the Association for Computational Linguistics: EMNLP 2023, Mar 2023

Abs PDF

We address the task of evidence retrieval for long document question answering, which involves locating relevant paragraphs within a document to answer a question. We aim to assess the applicability of large language models (LLMs) in the task of zero-shot long document evidence retrieval, owing to their unprecedented performance across various NLP tasks. However, currently the LLMs can consume limited context lengths as input, thus providing document chunks as inputs might overlook the global context while missing out on capturing the inter-segment dependencies. Moreover, directly feeding the large input sets can incur significant computational costs, particularly when processing the entire document (and potentially incurring monetary expenses with enterprise APIs like OpenAI’s GPT variants). To address these challenges, we propose a suite of techniques that exploit the discourse structure commonly found in documents. By utilizing this structure, we create a condensed representation of the document, enabling a more comprehensive understanding and analysis of relationships between different parts. We retain 99.6% of the best zero-shot approach’s performance, while processing only 26% of the total tokens used by the best approach in the information seeking evidence retrieval setup. We also show how our approach can be combined with \textitself-ask reasoning agent to achieve best zero-shot performance in complex multi-hop question answering, just ≈4% short of zero-shot performance using gold evidence.

Patents

PLD+: Accelerating LLM inference by leveraging the hidden states of Language Models (US Patent App. 18/924,398)
Evidence Retrieval for Long Document Question Answering Using Large Language Models (US Patent App. 18/508,437)
Automatic generation of handouts from multi-modal documents (US Patent App. 18/542,161)
Merging misidentified text structures in a document (US Patent App. 18/511,111)
Generating targeted layouts from source documents utilizing large language models with semantic hierarchical transformations (US Patent App. 18/809,147)
Generating a digital poster including multimodal content extracted from a source document
Document-based presentation generation (US Patent App. 18/675,451)