Collocation search & Mutual Information/Technical Term Extraction Criteria(Keyness)

<H1>Collocation Extract</H1> <H1><A href="http://pioneer.chula.ac.th/~awirote/colloc/"><FONT color=#193da9><SPAN style="FONT-SIZE: 10pt">http://pioneer.chula.ac.th/~awirote/colloc/</SPAN></FONT></A>        </H1> <H1><a href="javascript:checkVirus('grpid%3D19FGg%26fldid%3DMGXe%26dataid%3D76%26fileid%3D1%26regdt%3D20081030172614&url=http%3A%2F%2Fpds97.cafe.daum.net%2Fattach%2F4%2Fcafe%2F2008%2F10%2F30%2F17%2F24%2F49096f5aead22%3F.exe')"><img src="https://t1.daumcdn.net/daumtop_deco/icon/icon.hanmail.net/editor/p_etc_s.gif?rv=1.0.1" border="0" alt="첨부파일" class="vam"/> setupColloc.exe</a></H1> <P><SPAN style="FONT-SIZE: 12pt; FONT-FAMILY: Times New Roman">           Collocation Extract is designed to provide a list of collocations in the corpus. Users can search for collocates of a particular word in the range of 2 to 5 words, or search for all collocations of two-word chunks.</SPAN><SPAN lang=TH style="FONT-SIZE: 12pt; FONT-FAMILY: Angsana New"> </SPAN>Three <A href="http://pioneer.chula.ac.th/~awirote/colloc/statmethod1.htm"><U><FONT color=#0000ff>statistical methods</FONT></U></A>, namely Dunning's Log Likelihood, (pointwise) Mutual Information, and Chi-square, are used in this software.<SPAN style="FONT-SIZE: 12pt; FONT-FAMILY: Times New Roman"> Below is the steps in using the program</SPAN>   </P> <P> </P> <H1>Statistical Methods</H1> <P>Five statistical methods for identifying collocation are available in this program. They are Dunning's log likelihood, (pointwise) mutual information, chi-square, cubic association ratio (MI3), and Frager and McGowan coefficient. (See Manning and Schütze, 1999 and Oakes 1998, for more details) </P> <H2>1. Dunning's log likelihood :</H2> <P>The test compares two hypotheses about W1 and W2 <BR>Hypothesis 1 : P(w2|w1) = p = P(w2| not w1) <BR>Hypothesis 2 : P(w2|w1) = p1   not equal    p2 = P(w2|not w1) <BR>Assuming binomial distribution, the log likelihood ratio is calcuated as follows: <BR><IMG src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fc3-ll.gif" width=342 height=97> <BR>Note that c1 is the frequency of W1, c2 is the frequency of W2, c12 is the frequency of bigram W1-W2, N is the number of total words in the corpus, p = c2/N,  p1 = c12/c1, and  p2 = (c2 – c12) / (N-c1). The higher the value, the more likely that W2 is the collocation of W1. <BR>To see the statistical significance, multiply the result by -2, and consult the Chi-square table at the degree of freedom as one. </P> <P> </P> <P><A href="http://ucrel.lancs.ac.uk/llwizard.html"><FONT color=#0900ff>http://ucrel.lancs.ac.uk/llwizard.html</FONT></A></P> <H2>2. Pearson's Chi-square</H2> <P>Collocation between W1 and W2 is calculated in according to the frequency of bigram occurences in each cell. "W1 - W2" represents the number of occurences that W1 - W2 is found in the corpus. "not W1 - W2" represents the number of occurences that the preceding word of W2 is not W1. "W1 - not W2" represents the number of occurences that W1 is not followed by W2. "not W1 - not W2" represents the rest. <BR>  <TABLE cols=2 width="50%" border=1> <TBODY> <TR> <TD>W1 - W2</TD> <TD>not W1- W2</TD></TR> <TR> <TD>W1 - not W2</TD> <TD>not W1 - not W2</TD></TR></TBODY></TABLE></P> <P>Chi-square is calculated using the formula below, where Oi,j is the observed frequency in the table, and Ei,j is the expected frequency in each cell when W1 - W2 occur together by chance. Expect frequency on each cell is equal to (row total * column total ) / grand total <BR><IMG src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fc3-chi.gif" width=141 height=49> <BR>The higher the number, the more significant of collocation between W1 and W2. <P>  <P><A href="http://stattrek.com/online-calculator/chi-square.aspx"><FONT color=#0900ff>http://stattrek.com/online-calculator/chi-square.aspx</FONT></A> <H2>3. Mutual Information</H2> <P>This statistical method compare the probability of finding the two words together to the probability that the two words are independent to each other. If x' and y' and collocation, it is likely that P(x' y') be highly greater than P(x') * P(y'). Thus, the higher the value, the more likely that the two words are collocation. <BR><IMG src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fc3-mi.gif" width=172 height=44> <BR>However, mutual information tends to give high value for rare events. For example, when W1 and W2 always occur together, MI will be higher if the frequency is lower. </P> <P> </P> <P dir=rtl>When calculating statistical values for 3-,4-,5-words, pseudo-bigram transformation is used to estimate the value. (Silva and Lopes, 1999).</P> <H3>References</H3> <UL> <LI>Manning, Christopher and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge: MIT Press. <LI>Oakes, Micheal P. 1998. Statistics for Corpus Linguistics. Edinburg University Press. <LI>Silva, Joaquim Ferreira da, and Gabriel Pereira Lopes. 1999. A Local Maxima Method and a Fair Dispersion Normalization for Extraction Multi-word Units from Corpora. In Proceedings of the 6th Meeting on the Mathematics of Language, Orlando, July 23-25. </LI></UL> <H2>How To Use</H2> <P>1. First, select all files in the corpus (“File – New File List”). These files can be either plain text or annotated files, such as html, sgml, or xml files. Select the file type that matches the data. Once the File List is defined, users can save the list for future use, by clicking on “File-Save File List”. </P> <P><IMG border=0 src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fimage002.jpg" width=554 height=390></P> <P>2. Select one of the three statistical methods: log-likelihood, mutual information, and chi-square, or select “Raw Frequency” if you want to see only frequency of occurrences. The default is log-likelihood. </P> <P>3. Select the span ranged from 2 to 5. The number indicates the number of words to look for collocations. For example, if “2 words” is selected, the program will search for collocations of two-word chunks. If “3 words” is selected, the program will search for collocations of three-word chunks. </P> <P>4. Set all the options. First, set the direction for searching collocations by selecting “Option-Search Collocates on”. If “Left Side” is selected, the program looks for all collocates that occur before the keyword. The default is “Both Sides”. Select the minimum frequency of n-word collocations. This will instruct the program to look for only collocations that occur at least N times, where N is the number specified. Then, select the statistical significance at the level of “p > .005”, “p > .05”, or “all occurrences”. Set the maximum number of collocations to be extracted. The default is “500” items. When searching for 2-word collocations, users can specify the distance between the two words. If set as “2”, the two words are separated by one word. This option is provided because collocations sometimes can be separated by other words, such as “hold (oppositional) views”, “hold (a similar) view”, etc. The last option, “Ignore Header Tag”, is selected by default. This will instruct the program to ignore all information in the header tag <Header> ….. </Header>, which is encoded in sgml and xml files. (Information in the header tag is usually not the contents, but the bibliographic and encoding information.) </P> <P><IMG border=0 src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fimage004.jpg" width=554 height=390><BR></P> <P>5. Specify the search. There are two ways to specify the search. The first one is to specify the keyword (“Search – Keyword”) to be searched. The second way is to search for all 2-word collocations “Search - All 2-word Collocations”. When searching for all 2-word collocations, users can specify the distance between the two words, as explained before. </P> <P>6. When searching is finished, the collocation windows will display the list of collocates that co-occur with the keyword, sorted by order of significance. If two-word collocation is searched and the distance of the two words is greater than one, a number of underscore symbols, “_”, will be marked between the two words to indicate the distance. Users can save the collocation list by selecting “File-Save Colloc List”. The output will be saved as a text file with tab delimited between each column. The output file then can be imported into the Excel program for further use. </P> <P><IMG border=0 src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fimage006.jpg" width=554 height=390> </P> <P><IMG border=0 src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fimage008.jpg" width=554 height=390></P> <P>7. If users want to see the contexts of a particular word, they can click on that word, and select the menu bar “Concord”. The specified word and its collocate will be shown. Use “*” to mark for all words. Select the order of occurrences, and the number of characters for left and right contexts. </P> <P><IMG border=0 src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fimage010.jpg" width=258 height=125></P> <P>8. The concordance output will be shown in three columns. Users can save the result for further use by selecting “File - Save Concordance”. Although this concordance feature is made available in Collocation Extract, it is suggested that concordance software should be used, if users want to work extensively on concordance results. This feature is provided for users just to take a quick look at the contexts.</P> <P><IMG border=0 src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fimage012.jpg" width=553 height=299></P> <P></SPAN></P> <P>9. The program can list sequences of n-words from the corpus and the frequency of occurrences. Click on “n-word Frequency” menu, and specify the number of word sequence and the minimum frequency. For example, if n-word is set as “3”, and “Min Freq” as “5”, the program will list all sequences of three-word chunks that occur at least 5 times in the corpus. The result of this search will not be shown on the screen. It will be saved as a text file.</P> <P><IMG border=0 src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fpioneer.chula.ac.th%2F%7Eawirote%2Fcolloc%2Fimage013.gif" width=378 height=97></P> <P> </P> <HR width="100%"> <H2>Download</H2> <P>Click <A href="http://pioneer.chula.ac.th/~awirote/colloc/setupColloc.exe"><U><FONT color=#0000ff>here</FONT></U></A> to download Collocation Extract for Windows systems. The current version is 3.06<BR></P> <HR width="100%"> <P>This program is further developed from Collocation Test, which can handle only two-word collocations. The development of Collocation Test was sponsored by the Development Grants for New Faculty/Researchers in 1999, and a research grant from the Research Division of the Faculty of Arts in 2000. </P> <P>Copyright 2000-4. <A href="http://pioneer.chula.ac.th/~awirote/wire/"><U><FONT color=#0000ff>Wirote Aroonmanakun</FONT></U></A>. <BR><A href="http://www.arts.chula.ac.th/~ling/"><U><FONT color=#0000ff>Dept. of Linguistics</FONT></U></A>, <A href="http://www.chula.ac.th/"><U><FONT color=#0000ff>Chulalongkorn University</FONT></U></A> </P>