About SimSeerX

SimSeerX is a similar document search engine. It accepts a document as input and then uses several similarity functions to identify similar documents and rank them.

Similarity in SimSeerX

SimSeerX currently supports 3 notions of similarity:

Document Collections

SimSeerX currently indexes the following document collections: More collections coming soon!

Software

SimSeerX is built based on the Play! Framework and makes use of Solr/Lucene with custom similarity functions for indexing and searching. Information extraction is performed using CiteSeerExtractor (Williams et al., 2014) and keyphrase extraction is performed using Maui.

API

The SimSeerX API is described at http://simseerx.ist.psu.edu/api.

People

Kyle Williams
Kyle Williams developed, runs and maintains SimSeerX as part of his PhD research.
Prof. C. Lee Giles
Prof. C. Lee Giles is the PI on the SimSeerX project

Publications

A paper describing SimSeerX appeared in ACM Document Engineering 2014.

Williams, K., Wu, J., & Giles, C. L. (2014). SimSeerX: a similar document search engine. In Proceedings of the 2014 ACM symposium on Document engineering (pp. 143-146). ACM. [Download]

References

Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8), 1157-1166.

Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380-388). ACM.

Manku, G. S., Jain, A., & Das Sarma, A. (2007). Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web (pp. 141-150). ACM.

Medelyan, O., Frank, E., & Witten, I. H. (2009). Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3 (pp. 1318-1327). Association for Computational Linguistics.

Williams, K., Li, L., Khabsa, M., Wu, J., Shih, P. C., & Giles, C. L. (2014, June). A web service for scholarly big data information extraction. In Web Services (ICWS), 2014 IEEE International Conference on (pp. 105-112). IEEE.