James Clarke & Research

My research focuses on building computational systems capable of extracting, processing and understanding information. My approaches are grounded in Natural Language Processing and Machine Learning. Some of the problems I have worked on include: interpreting and understanding human language in the context of a world model; performing task-specific relation extraction and database population; and automatic summarization. I have extensive experience developing end-to-end novel Natural Language Processing systems and scalable pipelines.

I have been interested in leveraging context expressed through a world model to help natural language interpretation and understanding. Another part of my research involves using integer linear programming and other methods to create more global models for natural language processing problems. This is closely related to my PhD research in which I focused on developing methods to process, extract and summarise information from large natural text collections. In particular I formalised the compression task within an integer linear programming framework which allowed new and existing models to be supplemented with linguistic constraints.

On the development side I am interested in the architecture of complex natural language processing pipelines. This interest grew at The University of Illinois where I designed and developed The Curator, a common NLP platform, which provides a programmatic interface to perform, retrieve and store annotations of texts. Designed using Service Oriented Architecture principles, it exposes its end-points as software as a service. This allows us to reduce the amount of duplicate data processing and removes the burden of data processing, storage and maintenance from the user (researcher). The system is open-source and still widely used by the Cognitive Computation Group today.

The compression corpora from my sentence-compression experiments is available on the resources page.

Publications

PhD Thesis

James Clarke. 2008. Global Inference for Sentence Compression: An Integer Linear Programming Approach. PhD Thesis, University of Edinburgh.
Full Details  ∞  pdf

Journal Papers

James Clarke and Mirella Lapata. 2010. Discourse Constraints for Document Compression. In Computational Linguistics, vol. 36(3), pages 411–441.
Full Details  ∞  pdf

James Clarke and Mirella Lapata. 2008. Global Inference for Sentence Compression: An Integer Linear Programming Approach. In Journal of Articifial Intelligence Research, vol. 31, pages 399–429.
Full Details  ∞  pdf

Conference Papers

James Clarke, Vivek Srikumar, Mark Sammons, and Dan Roth. 2012. An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines). In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3267–3283. Istanbul, Turkey.
Full Details  ∞  pdf

Dan Goldwasser, Roi Reichart, James Clarke and Dan Roth. 2011. Confidence Driven Unsupervised Semantic Parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 2011), pages 1486–1495. Portland, Oregon.
Full Details  ∞  pdf

James Clarke, Dan Goldwasser, Ming-Wei Chang and Dan Roth. 2010. Driving Semantic Parsing from the World's Response. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL-2010), pages 18–27. Uppsala, Sweden.
Full Details  ∞  pdf  ∞  slides

Jacob Eisenstein, James Clarke, Dan Goldwasser and Dan Roth. 2009. Reading to Learn: Constructing Features from Semantic Abstracts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 958–967. Singapore.
Full Details  ∞  pdf

Sebastian Riedel and James Clarke. 2009. Revisiting Optimal Decoding for Machine Translation IBM Model 4. In Proceedings of the NAACL HLT 2009 Short Papers, pages 5–8. Boulder, Colorado.
Full Details  ∞  pdf

James Clarke and Mirella Lapata. 2007. Modelling Compression with Discourse Constraints. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and on Computational Natural Language Learning, pages 1–11. Prague, Czech Republic.
Full Details  ∞  pdf  ∞  slides  ∞  Received the Best Paper Award EMNLP-CoNLL 2007

Sebastian Riedel and James Clarke. 2006. Incremental Integer Linear Programming for Non-projective Dependency Parsing. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 129–137. Sydney, Australia.
Full Details  ∞  pdf  ∞  slides

James Clarke and Mirella Lapata. 2006. Constraint-Based Sentence Compression: An Integer Programming Approach. In Proceedings of the COLING/ACL 2006 Main Conference Poster Session, pages 144–151. Sydney, Australia.
Full Details  ∞  pdf  ∞  slides

James Clarke and Mirella Lapata. 2006. Models for Sentence Compression: A Comparison across Domains, Training Requirements and Evaluation Measures. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 377–384. Sydney, Australia.
Full Details  ∞  pdf  ∞  slides

People

Other people you should really be paying attention to:

Interact

We created a resource for people interested in Integer Linear Programming for Natural Language Processing. Although it is now mostly abandoned.

I co-organised the Workshop on Integer Linear Programming for Natural Language Processing hosted at NAACL HLT 2009. During this time we created a resource for people interested in Integer Linear Programming for Natural Language Processing. More information about the workshop can be found on the ILP for NLP wiki.

You can help our community by participating in our Language Experiments.

And the threat of NLP and global warming can be found via Jason Eisner.

Previously

A long time ago I designed a Multicast rsync and a tongue-in-cheek productivity hack.