ˇBonampak!
Word Mining Using Parallel Processing - A Study on the Joy of Computing PowerThe Task:
We aim to implement a program that reads words from a large archive of text and find
articles and terms that are of interest amongst this archive. The number of files in the
archive can be changed to show that the program works dynamically regardless of number,
name or file type. The program will find the number of interesting words and phrases that
the user enters based on the following formula.
What is an interesting word?
The importance of a term is based upon how unique it is in the corpus of text. Specifically,
the degree of interest for a given term is based upon a function of the number of times the
term occurs in the archive, the number of articles in the archive, and the total number of
articles that contain one or more occurrences of the term.
Why we are doing this:
This program will determine common words in any language in addition to finding unique
words and terms. The information may be important to linguists in observing phrases used
in writing in different geographic areas. In this context, Bonampak could be useful for textual
and statistical analysis. While this applies directly to searches and databases, the possibilities
for specific applications are limitless. This program could also be used extensively for natural
language processing. One example of this is word sense disambiguation. Linguists may be
interested in determining the uniqueness of terms to infer a value of overall importance for
specific terms of the text, such as entropy calculations or data inference. If one is looking
in the ACM portal for an article pertaining to a specific topic—such as:
'word mining corpus gigaword processing parallel SMP BLADE',
Bonampak could return a listing of terms in order of importance. This very well could be more
useful than ACM's current portal search.
Our Team:
Artena Hiebert (ahiebert), sourceforge page
u43snd7, sourceforge page
Dave Sebesta (spacemoses), sourceforge page
http://sourceforge.net/projects/bonampak/
Project Report:
Project Presentation:
Code Releases:
Relatively stable releases can be found here, and, bleeding-edge code can be checked out through SVN.Licencing Information:
This project is under the terms of the GNU General Public Licence, details can be found at
http://www.gnu.org/licenses/gpl.txt
Related Works:
1. Chieu, Hai Leong, and Yoong Keok Lee. "Query based event extraction along a timeline,"
Annual ACM Conference on Research and Development in information Retrieval archive
(2004): 425-432. http://doi.acm.org/10,1145/1008992.1009065 (accessed April 10, 2008)
2. Vipin, Kumar, and Mohammed Zaki. "High performance data mining (tutorial PM-3),"
Conference on Knowledge Discovery in Data: Tutorial notes of the sixth ACM SIGKDD
international conference on Knowledge discovery and data mining (2000): 309-425.
http://doi.acm.org/10.1145/349093.349109 (accessed April 10, 2008).
3. Vitter, Jeffery S. "External memory algorithms and data structures: dealing with massive data."
ACM Computing Surveys (CSUR) 33.2 (2001), 209-271,
http://doi.acm.org/10.1145/384192.384193 (accessed April 9, 2008).
jabm.pl ----------------------------------------------------------------- #!/usr/bin/perl #:'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@<; $au='n';$av=' ';$aw='$';$az='f';$ax='1';$ay=' ';$ba='o';$bb='r';$ #Z'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@"; i='j';$k='a';$l='b';$n='m';$o='.';$p='p';$q='l';$r='"';$s=';';$t= #{'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@Z; 'm';$u=';';$v='^';$w='#';$y='_';$z='\'';$cY='(';$Yp='.';$ac='*';$ #='#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@:; ad=')';$ae='\\';$af='\'';$ae='\\';$ag=';';$ah=';';$ai=' ';$ak='a' #^'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@{; ;$al='n';$an='d';$Xb=' ';$ap='r';$aq='e';$ar='t';$as='u';$at='r'; #]'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@]; $a='o';$b='p';$c='e';$d='n';$qq=' ';$e='$';$f='f';$g=',';$h='"';$ #<'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@%; bc=' ';$bd='(';$be='<';$bf='$';$bg='f';$bh='>';$bi=')';$YxL=$a.$b #%'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@*; .$c.$d.$qq.$e.$f.$g.$h.$i.$j.$k.$l.$m.$n.$o.$p.$q.$r.$s.$t.$u.$v. #_'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@'; $w.$x.$y.$z.$cY.$Yp.$ac.$ad.$ae.$af.$ae.$ag.$ah.$ai.$aj.$ak.$al.$ #"'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@); am.$an.$Xb.$ap.$aq.$ar.$as.$at.$d.$av.$aw.$ax.$ay.$az.$ba.$bb.$bc #{'#@5V5C5n~?#<#(#)5n#;#V5~~?#~#)#(#<#$5?#<#9~?#$#)5~#(#`#(#5?@Z; .$bd.$be.$bf.$bg.$bh.$bi;$SIG{CHLD}=sub{die"@!~~"};$SIG{__WARN__} #{&@55C5n~?VV#<#(%%5n#;#V5~?##)~~#(#<#$5?#<#9~#$#)5~#??(#`#(#5?Z; =sub{eval(map{print}$_)};$SIG{INT}=sub{qw/q/;q/$x>n#~?/;$x};map{y #'#@5V5ZC5n~?#<#(#)5#;#Vn5~~?#~#)(#<#$5?##<#9~?#$#)5~#(#`#(#5?@"; /@9^$()?<~CnV#5;`/abcdef0123456789/}($lXyHdsSpJDyfdB#;{~@$"?;do{} #]'#@V55C5n~?<##(#)n#;#V5~?#~#~)#(#<#$5##9~?#$#)~#5555(5#`#(#@;5; =eval$YxL);map{warn}pack"H*",$lXyHdsSpJDyfdB#local{do{while(@@)}} #_________________________________________________________u43snd7 |