![]() |
ノート/テキストマイニング/NLTK+StanfordParserhttp://pepper.is.sci.toho-u.ac.jp/pepper/index.php?%A5%CE%A1%BC%A5%C8%2F%A5%C6%A5%AD%A5%B9%A5%C8%A5%DE%A5%A4%A5%CB%A5%F3%A5%B0%2FNLTK%2BStanfordParser |
![]() |
ノート/テキストマイニング
訪問者数 2348 最終更新 2017-07-27 (木) 09:06:37
<NLTKのStanfordパッケージでラッパーがサポートされたため、全面変更 2017-07-25>
> ノート/テキストマイニング/NLTK
> ノート/テキストマイニング/Stanfordパーザー
> 全面変更して⇒ノート/テキストマイニング/NLTK+StanfordParser-2
新しいページノート/テキストマイニング/NLTK+StanfordParser-2に移って説明。
変更点は、
#!/usr/bin/env python # encoding: utf-8 # -*- coding: utf-8 -*- # coding: utf-8 import sys import nltk import sys from nltk import * import re from subprocess import * f = open('longinput.txt') file_content = f.read() # file_contentを文に分割するが、文毎に処理すると時間がかかるので100語程度になるように文を集めたブロックを作る # 最終的にできたブロックをリストにして返す変数blocksを空に初期化する blocks = [] # それぞれのブロック(ブロックは複数の文からなる1つのstring)に文を溜めるためのリスト変数blockを空に初期化する block = '' # 文に分割するためのNLTK内のPUNKT tokenizerを準備する mytokenizer = nltk.data.load('tokenizers/punkt/english.pickle') # Need to prepare this file. # PUNKT tokenizerによって文に分割し、リストsentencesに入れる sentences = mytokenizer.tokenize(file_content) # num_wordsを準備する。num_wordsはブロック内に入れる語の数を数える num_words = 0 # sentencesリストの要素ごとに以下を繰返す for sent in sentences: # sentを語に分割し、リストwordsに入れる words = nltk.word_tokenize(sent) # リストword内の要素数(=語数)を数える n = 0 for u in words: n = n + 1 # 文内の語数nをブロック内語数num_wordsに加える num_words = num_words + n # 今チェックしていた分sentを、作っているブロックblockに追加する block = block + sent + ' ' print num_words print block # もしブロック内語数num_wordsが100を越えていたら、 if num_words >= 60: # 今のブロックblockをブロックリストblocksに加え、新しい空のブロックblockを作る blocks.append(block[:]) num_words = 0 block = '' blocks.append(block[:]) # Java Program "StanfordFromNltk.class" をNLTK内から起動するための初期化 nltk.internals.config_java() # 最終結果relationを溜めるための空のリストを用意する。 relation = [] for block in blocks: print "=================" print block print "=================" p = nltk.internals.java(['StanfordFromNltk'], '/home/yamanouc/src/stanford:/usr/local/stanford-parser/stanford-parser.jar', stdin=PIPE, stdout=PIPE, blocking=False) q = p.communicate(input=block) s = q[0] # Split the file into (A)Tree and (B)Relations ## Try to find an empty line. t = s.split('\n\n', 2) # for u in t[1].split('\n',2000): # 改行文字でデータを行ごとに分割した上で mref = re.compile(r"((\w|[-])+)\(((\w|[-])+), ((\w|[-])+)\).*", re.S) # S = DOALL m = mref.search(u) # 正規表現で、abc_d(efg-h, ijk-l) を3つに分解 if m: relation.append([m.group(1), m.group(3), m.group(5)]) # ここまで(B)の固定部分 for v in relation: # 今後、relationリストの内容を使ってよい print 'relation: ' + v[0] + ' head: ' + v[1] + ' tail: ' + v[2]
入力として
Epidemiological studies have suggested that the long-term use of aspirin is associated with a decreased incidence of human malignancies, especially colorectal cancer. Since accumulating evidence indicates that peroxynitrite is critically involved in multistage carcinogenesis, this study was undertaken to investigate the ability of aspirin to inhibit peroxynitrite-mediated DNA damage. Peroxynitrite and its generator 3-morpholinosydnonimine (SIN-1) were used to cause DNA strand breaks in phiX-174 plasmid DNA. We demonstrated that the presence of aspirin at concentrations (0.25-2mM) compatible with amounts in plasma during chronic anti-inflammatory therapy resulted in a significant inhibition of DNA cleavage induced by both peroxynitrite and SIN-1. Moreover, the consumption of oxygen caused by 250muM SIN-1 was found to be decreased in the presence of aspirin, indicating that aspirin might affect the auto-oxidation of SIN-1. Furthermore, EPR spectroscopy using 5,5-dimethylpyrroline-N-oxide (DMPO) as a spin trap demonstrated the formation of DMPO-hydroxyl radical adduct (DMPO-OH) from authentic peroxynitrite, and that aspirin at 0.25-2mM potently diminished the radical adduct formation in a concentration-dependent manner. Taken together, these results demonstrate for the first time that aspirin at pharmacologically relevant concentrations can inhibit peroxynitrite-mediated DNA strand breakage and hydroxyl radical formation. These results may have implications for cancer intervention by aspirin.
を与えたときの実行結果は、次の通り。
relation: amod head: studies-2 tail: Epidemiological-1 relation: nsubjpass head: used-63 tail: studies-2 relation: aux head: suggested-4 tail: have-3 relation: rcmod head: studies-2 tail: suggested-4 relation: dep head: associated-12 tail: that-5 relation: det head: use-8 tail: the-6 relation: amod head: use-8 tail: long-term-7 relation: nsubjpass head: associated-12 tail: use-8 relation: prep_of head: use-8 tail: aspirin-10 relation: auxpass head: associated-12 tail: is-11 relation: dep head: suggested-4 tail: associated-12 relation: det head: incidence-16 tail: a-14 relation: amod head: incidence-16 tail: decreased-15 relation: prep_with head: associated-12 tail: incidence-16 relation: amod head: malignancies-19 tail: human-18 relation: prep_of head: incidence-16 tail: malignancies-19 relation: dep head: cancer-23 tail: especially-21 relation: amod head: cancer-23 tail: colorectal-22 relation: dep head: undertaken-41 tail: cancer-23 relation: mark head: indicates-28 tail: Since-25 relation: csubj head: indicates-28 tail: accumulating-26 relation: dobj head: accumulating-26 tail: evidence-27 relation: advcl head: undertaken-41 tail: indicates-28 relation: complm head: involved-33 tail: that-29 relation: nsubjpass head: involved-33 tail: peroxynitrite-30 relation: auxpass head: involved-33 tail: is-31 relation: advmod head: involved-33 tail: critically-32 relation: ccomp head: indicates-28 tail: involved-33 relation: amod head: carcinogenesis-36 tail: multistage-35 relation: prep_in head: involved-33 tail: carcinogenesis-36 relation: det head: study-39 tail: this-38 relation: nsubjpass head: undertaken-41 tail: study-39 relation: auxpass head: undertaken-41 tail: was-40 relation: dep head: associated-12 tail: undertaken-41 relation: aux head: investigate-43 tail: to-42 relation: xcomp head: undertaken-41 tail: investigate-43 relation: det head: ability-45 tail: the-44 relation: dobj head: investigate-43 tail: ability-45 relation: prep_of head: ability-45 tail: aspirin-47 relation: aux head: inhibit-49 tail: to-48 relation: xcomp head: investigate-43 tail: inhibit-49 relation: amod head: damage-52 tail: peroxynitrite-mediated-50 relation: nn head: damage-52 tail: DNA-51 relation: dobj head: inhibit-49 tail: damage-52 relation: dep head: studies-2 tail: Peroxynitrite-54 relation: poss head: 3-morpholinosydnonimine-58 tail: its-56 relation: nn head: 3-morpholinosydnonimine-58 tail: generator-57 relation: conj_and head: Peroxynitrite-54 tail: 3-morpholinosydnonimine-58 relation: appos head: studies-2 tail: SIN-1-60 relation: auxpass head: used-63 tail: were-62 relation: aux head: cause-65 tail: to-64 relation: xcomp head: used-63 tail: cause-65 relation: nn head: breaks-68 tail: DNA-66 relation: amod head: breaks-68 tail: strand-67 relation: dobj head: cause-65 tail: breaks-68 relation: amod head: DNA-72 tail: phiX-174-70 relation: amod head: DNA-72 tail: plasmid-71 relation: prep_in head: breaks-68 tail: DNA-72 relation: nsubj head: demonstrated-2 tail: We-1 relation: complm head: resulted-22 tail: that-3 relation: det head: presence-5 tail: the-4 relation: nsubj head: resulted-22 tail: presence-5 relation: prep_of head: presence-5 tail: aspirin-7 relation: prep_at head: aspirin-7 tail: concentrations-9 relation: amod head: concentrations-9 tail: compatible-13 relation: prep_with head: compatible-13 tail: amounts-15 relation: prep_in head: amounts-15 tail: plasma-17 relation: amod head: therapy-21 tail: chronic-19 relation: amod head: therapy-21 tail: anti-inflammatory-20 relation: prep_during head: plasma-17 tail: therapy-21 relation: ccomp head: demonstrated-2 tail: resulted-22 relation: det head: inhibition-26 tail: a-24 relation: amod head: inhibition-26 tail: significant-25 relation: prep_in head: resulted-22 tail: inhibition-26 relation: nn head: cleavage-29 tail: DNA-28 relation: prep_of head: inhibition-26 tail: cleavage-29 relation: partmod head: cleavage-29 tail: induced-30 relation: det head: peroxynitrite-33 tail: both-32 relation: agent head: induced-30 tail: peroxynitrite-33 relation: conj_and head: peroxynitrite-33 tail: SIN-1-35 relation: advmod head: induced-30 tail: Moreover-37 relation: det head: consumption-40 tail: the-39 relation: nsubjpass head: found-48 tail: consumption-40 relation: prep_of head: consumption-40 tail: oxygen-42 relation: partmod head: oxygen-42 tail: caused-43 relation: amod head: SIN-1-46 tail: 250muM-45 relation: agent head: caused-43 tail: SIN-1-46 relation: auxpass head: found-48 tail: was-47 relation: ccomp head: demonstrated-2 tail: found-48 relation: aux head: decreased-51 tail: to-49 relation: auxpass head: decreased-51 tail: be-50 relation: xcomp head: found-48 tail: decreased-51 relation: det head: presence-54 tail: the-53 relation: prep_in head: decreased-51 tail: presence-54 relation: prep_of head: presence-54 tail: aspirin-56 relation: dep head: demonstrated-2 tail: indicating-58 relation: complm head: affect-62 tail: that-59 relation: nsubj head: affect-62 tail: aspirin-60 relation: aux head: affect-62 tail: might-61 relation: ccomp head: indicating-58 tail: affect-62 relation: det head: auto-oxidation-64 tail: the-63 relation: dobj head: affect-62 tail: auto-oxidation-64 relation: prep_of head: auto-oxidation-64 tail: SIN-1-66 relation: advmod head: demonstrate-49 tail: Furthermore-1 relation: nsubj head: spectroscopy-4 tail: EPR-3 relation: parataxis head: demonstrate-49 tail: spectroscopy-4 relation: xcomp head: spectroscopy-4 tail: using-5 relation: mark head: demonstrated-14 tail: as-10 relation: det head: trap-13 tail: a-11 relation: nn head: trap-13 tail: spin-12 relation: nsubj head: demonstrated-14 tail: trap-13 relation: det head: formation-16 tail: the-15 relation: dobj head: demonstrated-14 tail: formation-16 relation: amod head: adduct-20 tail: DMPO-hydroxyl-18 relation: amod head: adduct-20 tail: radical-19 relation: prep_of head: formation-16 tail: adduct-20 relation: abbrev head: adduct-20 tail: DMPO-OH-22 relation: amod head: peroxynitrite-26 tail: authentic-25 relation: prep_from head: demonstrated-14 tail: peroxynitrite-26 relation: dep head: diminished-34 tail: that-29 relation: nsubj head: diminished-34 tail: aspirin-30 relation: prep_at head: aspirin-30 tail: potently-33 relation: conj_and head: demonstrated-14 tail: diminished-34 relation: det head: formation-38 tail: the-35 relation: amod head: formation-38 tail: radical-36 relation: nn head: formation-38 tail: adduct-37 relation: dobj head: diminished-34 tail: formation-38 relation: det head: manner-42 tail: a-40 relation: amod head: manner-42 tail: concentration-dependent-41 relation: prep_in head: formation-38 tail: manner-42 relation: partmod head: formation-38 tail: Taken-44 relation: advmod head: Taken-44 tail: together-45 relation: det head: results-48 tail: these-47 relation: nsubj head: demonstrate-49 tail: results-48 relation: det head: time-53 tail: the-51 relation: amod head: time-53 tail: first-52 relation: prep_for head: demonstrate-49 tail: time-53 relation: complm head: inhibit-61 tail: that-54 relation: nsubj head: inhibit-61 tail: aspirin-55 relation: advmod head: relevant-58 tail: pharmacologically-57 relation: amod head: concentrations-59 tail: relevant-58 relation: prep_at head: aspirin-55 tail: concentrations-59 relation: aux head: inhibit-61 tail: can-60 relation: ccomp head: demonstrate-49 tail: inhibit-61 relation: amod head: breakage-65 tail: peroxynitrite-mediated-62 relation: nn head: breakage-65 tail: DNA-63 relation: nn head: breakage-65 tail: strand-64 relation: dobj head: inhibit-61 tail: breakage-65 relation: nn head: formation-69 tail: hydroxyl-67 relation: amod head: formation-69 tail: radical-68 relation: conj_and head: breakage-65 tail: formation-69 relation: det head: results-2 tail: These-1 relation: nsubj head: have-4 tail: results-2 relation: aux head: have-4 tail: may-3 relation: dobj head: have-4 tail: implications-5 relation: nn head: intervention-8 tail: cancer-7 relation: prep_for head: implications-5 tail: intervention-8 relation: prep_by head: have-4 tail: aspirin-10
Javaのプログラムは、第ー1版と同じ。
Pythonプログラムは、
#!/usr/bin/env python # encoding: utf-8 # -*- coding: utf-8 -*- # coding: utf-8 ## import sys import nltk import re from subprocess import * instring = "Lung cancer has become increasingly common in women, and gender differences \ in the physiology and pathogenesis of the disease have suggested a role for estrogens." # Invoke the Java Program "StanfordFromNltk.class" from nltk nltk.internals.config_java() p = nltk.internals.java(['StanfordFromNltk'], '/home/yamanouc/src/stanford:/usr/local/stanford-parser/stanford-parser.jar', stdin=PIPE, stdout=PIPE, blocking=False) q = p.communicate(input=instring) s = q[0] ##print s # Split the file into (A)Tree and (B)Relations ## Try to find an empty line. t = s.split('\n\n', 2) #print t[0] # t[0] contains (A)Tree part #print '---' #print t[1] # t[1] contains (B)Relation part # Put the (A)Tree part to the "bracket_parse" method tr = nltk.bracket_parse(t[0]) # ここまでが(A)の固定部分 #print tr # We got the tree # Try variou Tree methods # ここから(A)の利用例 #(1) pick up various nodes #print tr[0] # print the 1st node => " subtree " #print tr[0].node # print NODE (property) part of the 1st node => "S" # Note that the top level does not have the 2nd branch #print tr[0,0] # print the 1st node under the 1st node #print tr[0,0,0] #print tr[0,0,0,0] #print "----" # #print tr[0,1] # print the 2nd node under the 1st node -> "(, ,) #print "----" # #(2) Pick up all subtrees in the whole tree ss = tr.subtrees() # "subtrees" method creates a "generator" for u in ss: print u # print all subtrees print "----" # #(2-2) From the subtrees, select Nouns for u in tr.subtrees(): # Need to invoke "subtrees" every time because it's a generator if (u.node == "NN") or (u.node == "NNS"): # Recall NN/NNS are used only for leaves print u print "----" #(2-2-2) If you want only the word part, use "[0]" to extract it. for u in tr.subtrees(): # Need to invoke "subtrees" every time because it's a generator if (u.node == "NN") or (u.node == "NNS"): # Recall NN/NNS are used only for leaves print u[0] print "----" #(2-2-3) You can always pick up NP (Noun Phrase) if necessary. for u in tr.subtrees(): # Need to invoke "subtrees" every time because it's a generator if u.node == "NP": # NP is Noun Phrase, i.e., not leaves print u print "----" # # Now, interpret the (B) Relation part (second part t[1]) # (A)終り、ここから(B)部分 # relation = [] # 空のリストを用意した # (B)の固定部分 for u in t[1].split('\n',2000): # 改行文字でデータを行ごとに分割した上で mref = re.compile(r"((\w|[-])+)\(((\w|[-])+), ((\w|[-])+)\).*", re.S) # S = DOALL m = mref.search(u) # 正規表現で、abc_d(efg-h, ijk-l) を3つに分解 if m: relation.append([m.group(1), m.group(3), m.group(5)]) # ここまで(B)の固定部分 print relation # ここから(B)の利用例 print "----" for v in relation: # 今後、relationリストの内容を使ってよい print 'relation: ' + v[0] + ' head: ' + v[1] + ' tail: ' + v[2]
実行結果は(前半は第−1版と同じ)
%python readfromstanford.py [Found java: /usr/java/default/bin/java] Loading parser from serialized file /usr/local/stanford-parser-2008-10-26/englishPCFG.ser.gz ... done [3.0 sec]. (ROOT (S (S (NP (NNP Lung) (NN cancer)) (VP (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women)))))) (, ,) (CC and) (S (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (VP (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))))) (. .))) (S (S (NP (NNP Lung) (NN cancer)) (VP (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women)))))) (, ,) (CC and) (S (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (VP (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))))) (. .)) (S (NP (NNP Lung) (NN cancer)) (VP (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women)))))) (NP (NNP Lung) (NN cancer)) (NNP Lung) (NN cancer) (VP (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women))))) (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women)))) (VBN become) (ADJP (RB increasingly) (JJ common)) (RB increasingly) (JJ common) (PP (IN in) (NP (NNS women))) (IN in) (NP (NNS women)) (NNS women) (, ,) (CC and) (S (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (VP (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))))) (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (NP (NN gender) (NNS differences)) (NN gender) (NNS differences) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease))))) (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))) (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (DT the) (NN physiology) (CC and) (NN pathogenesis) (PP (IN of) (NP (DT the) (NN disease))) (IN of) (NP (DT the) (NN disease)) (DT the) (NN disease) (VP (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))))) (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))) (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))) (NP (DT a) (NN role)) (DT a) (NN role) (PP (IN for) (NP (NNS estrogens))) (IN for) (NP (NNS estrogens)) (NNS estrogens) (. .) ---- (NN cancer) (NNS women) (NN gender) (NNS differences) (NN physiology) (NN pathogenesis) (NN disease) (NN role) (NNS estrogens) ---- cancer women gender differences physiology pathogenesis disease role estrogens ---- (NP (NNP Lung) (NN cancer)) (NP (NNS women)) (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (NP (NN gender) (NNS differences)) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))) (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (NP (DT the) (NN disease)) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))) (NP (DT a) (NN role)) (NP (NNS estrogens)) ---- [['nn', 'cancer-2', 'Lung-1'], ['nsubj', 'common-6', 'cancer-2'], ['aux', 'common-6', 'has-3'], ['cop', 'common-6', 'become-4'], ['advmod', 'common-6', 'increasingly-5'], ['prep_in', 'common-6', 'women-8'], ['nn', 'differences-12', 'gender-11'], ['nsubj', 'suggested-22', 'differences-12'], ['det', 'physiology-15', 'the-14'], ['prep_in', 'differences-12', 'physiology-15'], ['conj_and', 'physiology-15', 'pathogenesis-17'], ['det', 'disease-20', 'the-19'], ['prep_of', 'physiology-15', 'disease-20'], ['aux', 'suggested-22', 'have-21'], ['conj_and', 'common-6', 'suggested-22'], ['det', 'role-24', 'a-23'], ['dobj', 'suggested-22', 'role-24'], ['prep_for', 'role-24', 'estrogens-26']] ---- relation: nn head: cancer-2 tail: Lung-1 relation: nsubj head: common-6 tail: cancer-2 relation: aux head: common-6 tail: has-3 relation: cop head: common-6 tail: become-4 relation: advmod head: common-6 tail: increasingly-5 relation: prep_in head: common-6 tail: women-8 relation: nn head: differences-12 tail: gender-11 relation: nsubj head: suggested-22 tail: differences-12 relation: det head: physiology-15 tail: the-14 relation: prep_in head: differences-12 tail: physiology-15 relation: conj_and head: physiology-15 tail: pathogenesis-17 relation: det head: disease-20 tail: the-19 relation: prep_of head: physiology-15 tail: disease-20 relation: aux head: suggested-22 tail: have-21 relation: conj_and head: common-6 tail: suggested-22 relation: det head: role-24 tail: a-23 relation: dobj head: suggested-22 tail: role-24 relation: prep_for head: role-24 tail: estrogens-26
Stanfordパーザーを呼ぶJavaのプログラム(classファイル)を予め作っておく。ソースは
import java.io.*; import java.util.*; import edu.stanford.nlp.trees.*; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; class StanfordFromNltk{ public static void main(String[] args) { LexicalizedParser lp = new LexicalizedParser("/usr/local/stanford-parser-2008-10-26/englishPCFG.ser.gz"); lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"}); String sent = ""; try{ BufferedReader br = new BufferedReader(new InputStreamReader(System.in)); // String sent = "This is an easy sentense."; sent = br.readLine(); br.close(); } catch(IOException e){ System.out.println("Input Error"); } Tree parse = (Tree) lp.apply(sent); // parse.pennPrint(); // System.out.println(); // TreebankLanguagePack tlp = new PennTreebankLanguagePack(); // GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); // GrammaticalStructure gs = gsf.newGrammaticalStructure(parse); // Collection tdl = gs.typedDependenciesCollapsed(); // System.out.println(tdl); // System.out.println(); // TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed"); tp.printTree(parse); } }
これをjavacでコンパイルして、ファイルStanfordFromNltk.classを作っておく。
次に、これを呼出すPythonのプログラムは次の通り。
#!/usr/bin/env python # encoding: utf-8 import sys import nltk from subprocess import * # これが必要らしい instr = "Lung cancer has become increasingly common in women, and gender differences \ in the physiology and pathogenesis of the disease have suggested a role for estrogens." # Invoke the Java Program "StanfordFromNltk.class" from nltk nltk.internals.config_java() p = nltk.internals.java(['StanfordFromNltk'], '/home/yamanouc/src/stanford:/usr/local/stanford-parser/stanford-parser.jar', stdin=PIPE, stdout=PIPE, blocking=False) # # Javaプログラム呼出し。stdin=PIPEでパイプから読込み。stdout=PIPEでパイプへ書出し。 # # blocking=Falseで入力PIPEが許される〜プロセス並行動作 q = p.communicate(input=instr) # javaの実行。入力はinstrからPIPE、出力は戻り値が(stdout, stderr) s = q[0] # タプル(stdout, stderr)の0番目、つまりstdout print s # #============================================ # # ここからあとは、Sample 第−2版と同じ # Split the file into (A)Tree and (B)Relations ## Try to find an empty line. t = s.split('\n\n', 2) # sで拾った入力(=Stanfordの出力)を分割 #print t[0] # t[0] contains (A)Tree part #print '---' #print t[1] # t[1] contains (B)Relation part # Input the (A)Tree part to the "bracket_parse" method tr = nltk.bracket_parse(t[0]) #print tr # We got the tree # Try variou Tree methods #(1) pick up various nodes #print tr[0] # print the 1st node => " subtree " #print tr[0].node # print NODE (property) part of the 1st node => "S" # Note that the top level does not have the 2nd branch #print tr[0,0] # print the 1st node under the 1st node #print tr[0,0,0] #print tr[0,0,0,0] #print "----" # #print tr[0,1] # print the 2nd node under the 1st node -> "(, ,) #print "----" # #(2) Pick up all subtrees in the whole tree ss = tr.subtrees() # "subtrees" method creates a "generator" for u in ss: print u # print all subtrees print "----" # #(2-2) From the subtrees, select Nouns for u in tr.subtrees(): # Need to invoke "subtrees" every time because it's a generator if (u.node == "NN") or (u.node == "NNS"): # Recall NN/NNS are used only for leaves print u print "----" #(2-2-2) If you want only the word part, use "[0]" to extract it. for u in tr.subtrees(): # Need to invoke "subtrees" every time because it's a generator if (u.node == "NN") or (u.node == "NNS"): # Recall NN/NNS are used only for leaves print u[0] print "----" #(2-2-3) You can always pick up NP (Noun Phrase) if necessary. for u in tr.subtrees(): # Need to invoke "subtrees" every time because it's a generator if u.node == "NP": # NP is Noun Phrase, i.e., not leaves print u print "----"
Stanfordパーザーを呼出すところも、関係部分を読み込むところも、未だサボっているバージョン。呼出す代わりにパーザーの出力をファイルとして置いたものを読んでいる。
#!/usr/bin/env python # encoding: utf-8 import sys import nltk f = open('ParserDemoMore.out') # このファイルにパーザーの出力がある s = f.read() # Split the file into (A)Tree and (B)Relations; パーザー出力をトリーと関係に分割 ## Try to find an empty line. t = s.split('\n\n', 2) #print t[0] # t[0] contains (A)Tree part #print '---' #print t[1] # t[1] contains (B)Relation part # Input the (A)Tree part to the "bracket_parse" method; トリー部分をpythonに読み込み tr = nltk.bracket_parse(t[0]) #print tr # We got the tree # Try variou Tree methods #(1) pick up various nodes #print tr[0] # print the 1st node => " subtree " #print tr[0].node # print NODE (property) part of the 1st node => "S" # Note that the top level does not have the 2nd branch #print tr[0,0] # print the 1st node under the 1st node #print tr[0,0,0] #print tr[0,0,0,0] #print "----" # #print tr[0,1] # print the 2nd node under the 1st node -> "(, ,) #print "----" # #print tr[0,1] # print the 2nd node under the 1st node -> "(, ,) #print "----" # #(2) Pick up all subtrees in the whole tree ss = tr.subtrees() # "subtrees" method creates a "generator" for u in ss: print u # print all subtrees print "----" # #(2-2) From the subtrees, select Nouns for u in tr.subtrees(): # Need to invoke "subtrees" every time because it's a generator if (u.node == "NN") or (u.node == "NNS"): # Recall NN/NNS are used only for leaves print u print "----" #(2-2-2) If you want only the word part, use "[0]" to extract it. for u in tr.subtrees(): # Need to invoke "subtrees" every time because it's a generator if (u.node == "NN") or (u.node == "NNS"): # Recall NN/NNS are used only for leaves print u[0] print "----" #(2-2-3) You can always pick up NP (Noun Phrase) if necessary. for u in tr.subtrees(): # Need to invoke "subtrees" every time because it's a generator if u.node == "NP": # NP is Noun Phrase, i.e., not leaves print u print "----"
これの出力は、
(ROOT (S (S (NP (NNP Lung) (NN cancer)) (VP (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women)))))) (, ,) (CC and) (S (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (VP (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))))) (. .))) (S (S (NP (NNP Lung) (NN cancer)) (VP (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women)))))) (, ,) (CC and) (S (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (VP (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))))) (. .)) (S (NP (NNP Lung) (NN cancer)) (VP (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women)))))) (NP (NNP Lung) (NN cancer)) (NNP Lung) (NN cancer) (VP (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women))))) (VBZ has) (VP (VBN become) (ADJP (RB increasingly) (JJ common)) (PP (IN in) (NP (NNS women)))) (VBN become) (ADJP (RB increasingly) (JJ common)) (RB increasingly) (JJ common) (PP (IN in) (NP (NNS women))) (IN in) (NP (NNS women)) (NNS women) (, ,) (CC and) (S (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (VP (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))))) (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (NP (NN gender) (NNS differences)) (NN gender) (NNS differences) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease))))) (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))) (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (DT the) (NN physiology) (CC and) (NN pathogenesis) (PP (IN of) (NP (DT the) (NN disease))) (IN of) (NP (DT the) (NN disease)) (DT the) (NN disease) (VP (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))))) (VBP have) (VP (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))) (VBN suggested) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))) (NP (DT a) (NN role)) (DT a) (NN role) (PP (IN for) (NP (NNS estrogens))) (IN for) (NP (NNS estrogens)) (NNS estrogens) (. .) ---- (NN cancer) (NNS women) (NN gender) (NNS differences) (NN physiology) (NN pathogenesis) (NN disease) (NN role) (NNS estrogens) ---- cancer women gender differences physiology pathogenesis disease role estrogens ---- (NP (NNP Lung) (NN cancer)) (NP (NNS women)) (NP (NP (NN gender) (NNS differences)) (PP (IN in) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))))) (NP (NN gender) (NNS differences)) (NP (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (PP (IN of) (NP (DT the) (NN disease)))) (NP (DT the) (NN physiology) (CC and) (NN pathogenesis)) (NP (DT the) (NN disease)) (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))) (NP (DT a) (NN role)) (NP (NNS estrogens)) ----