Last modified: Wednesday, May 20, 2009
Informatics team finds simple rules that explain universal laws of written text
FOR IMMEDIATE RELEASE
May 20, 2009
BLOOMINGTON, Ind. -- Two Indiana University School of Informatics professors have written a paper explaining a model they have developed that could lead to improved techniques for identifying key terms that capture the topics of a Web page.
Professors Alessandro Flammini and Filippo Menczer, along with M. Ángeles Serrano from the University of Barcelona, have authored a paper explaining the model entitled "Modeling Statistical Properties of Written Text." The paper appears in the PLoS One open access online journal and can be viewed at https://www.plosone.org/article/info:doi/10.1371/journal.pone.0005372.
The paper details the trio's introduction and validation of a generative model that explains from simple rules the simultaneous emergence of patterns of written text observed in many languages. The paper focuses on the well-known Zipf's law of word frequencies, as well as additional patterns such as Heaps' law of word diversity, the bursty nature of rare words, and similarity among documents.
Through their model, the researchers found a connection between word burstiness and the topicality of text. In addition, they identify dynamic word ranking and memory across documents as key mechanisms to explain the organization of written text.
This research could potentially have broad implications and practical applications in computer science, cognitive science and linguistics. For example, all search engines are based on analysis of text. The model developed by the researchers and the findings of this paper could lead to improved techniques for identifying key terms that capture the topics of a Web page, which is crucial for matching search queries to relevant results.
The semantic similarity between topics -- which is one of the features that the model aims to explain -- is visualized by a similarity cloud developed by computer science graduate student Mark Meiss. For example, the similarity cloud in the figure illustrates the topical relationships between 'mac' and 'pc.'
"Our paper hopefully will spur further research in this area," said Menczer. "In the end, this model could lay the groundwork that will help us improve a wide range of applications that are based on the analysis of the written word, such as search engines, contextual online advertising and topic detection."
"This is the type of research that we are so proud to be doing here at the School of Informatics," said Geoffrey Fox, chair of informatics. "To be conducting interdisciplinary practical studies that could impact an entire field and lead to development of ground-breaking applications -- that's our goal."
About the IU School of Informatics
Founded in 2000 as the first school of its kind in the United States, the Indiana University School of Informatics is dedicated to research and teaching across a broad range of computing and information technology, with emphases on science, applications and societal implications. The school includes computer science and informatics on the Bloomington campus and informatics on the IUPUI campus. The school administers a variety of bachelor and masters degree programs in computer science and informatics, as well as doctoral programs in computer science and the first-ever Ph.D. in informatics. The school is dedicated to excellence in education and research, to partnerships that bolster economic development and entrepreneurship, and to increasing opportunities for women and underrepresented minorities in computing and technology. For more information, visit www.informatics.indiana.edu.
To speak with Flammini or Menczer, please contact Lisa Herrmann at 812-855-5125 or firstname.lastname@example.org.