With large language models becoming larger than ever, I believe efficiency of the architectures should be a constant question. I am interested in building parameter-efficient and fast throughput architectures [1,2] In an earlier work, we showed that even existing architectures like the Transformer can be made to converge significantly faster during pre-training by guiding their attention patterns .
Zero and few-shot learning
Large language models should use efficient architectures, but should also be data-efficient during fine-tuning and inference. In a series of two papers [1,2], we proposed a paradigm called semantic supervision which showed that representing classes not just symbolically but with a large number of semantic descriptions improves generalization to classes not seen during training. Our method can help generalize to millions of classes in the extreme classification setting .
Democratization of NLP involves improve language technologies for all language, not just English. I am interested in building better multilingual models, and that involves understanding the shortcomings of the current SOTA. In a series of two papers [1,2] we showed that while multilingual pre-training objectives achieve impressive zero-shot cross-lingual transfer, they fail between very simple languages which differ only in their word order. We hope this can motivate improvement in this line of work.
I am particularly interested in how to use RL for NLP, and vice-versa. We showed that using sentiment analysis to predict the sentiment of text observations can be useful for reward shaping . Previously, I have worked on inferring options from offline expert trajectories  and improving hindsigh learning .