ノート/テキストマイニング
訪問者数 1527      最終更新 2010-01-11 (月) 17:52:19
> ノート/テキストマイニング/NLTK
> ノート/テキストマイニング/Stanfordパーザー
> 全面変更して⇒ノート/テキストマイニング/NLTK+StanfordParser-2

かなり構造を変えて、第1版に移行 (2010/01/11)

新しいページノート/テキストマイニング/NLTK+StanfordParser-2に移って説明。
変更点は、

サンプル 第0改良版  = 第0版に、文分割機能を前置し、複数文からなる長い文章を入力可能にした+ファイルから入力 (2010/01/06)

#!/usr/bin/env python # encoding: utf-8
# -*- coding: utf-8 -*-
# coding: utf-8
import sys
import nltk
import sys
from nltk import *
import re
from subprocess import *

f = open('longinput.txt')
file_content = f.read()

# file_contentを文に分割するが、文毎に処理すると時間がかかるので100語程度になるように文を集めたブロックを作る
# 最終的にできたブロックをリストにして返す変数blocksを空に初期化する
blocks = []
# それぞれのブロック(ブロックは複数の文からなる1つのstring)に文を溜めるためのリスト変数blockを空に初期化する
block = ''

# 文に分割するためのNLTK内のPUNKT tokenizerを準備する
mytokenizer = nltk.data.load('tokenizers/punkt/english.pickle')  # Need to prepare this file.

# PUNKT tokenizerによって文に分割し、リストsentencesに入れる
sentences = mytokenizer.tokenize(file_content)

# num_wordsを準備する。num_wordsはブロック内に入れる語の数を数える
num_words = 0

# sentencesリストの要素ごとに以下を繰返す
for sent in sentences:
  # sentを語に分割し、リストwordsに入れる
  words = nltk.word_tokenize(sent)
  # リストword内の要素数(=語数)を数える
  n = 0
  for u in words:
    n = n + 1
  # 文内の語数nをブロック内語数num_wordsに加える
  num_words = num_words + n
  # 今チェックしていた分sentを、作っているブロックblockに追加する
  block = block + sent + '  '
  print num_words
  print block
  # もしブロック内語数num_wordsが100を越えていたら、
  if num_words >= 60:
    # 今のブロックblockをブロックリストblocksに加え、新しい空のブロックblockを作る
    blocks.append(block[:])
    num_words = 0
    block = ''

blocks.append(block[:])

# Java Program "StanfordFromNltk.class" をNLTK内から起動するための初期化
nltk.internals.config_java()

# 最終結果relationを溜めるための空のリストを用意する。
relation = []

for block in blocks:
  print "================="
  print block
  print "================="
  p = nltk.internals.java(['StanfordFromNltk'], '/home/yamanouc/src/stanford:/usr/local/stanford-parser/stanford-parser.jar',
stdin=PIPE, stdout=PIPE, blocking=False)

  q = p.communicate(input=block)
  s = q[0]

  # Split the file into (A)Tree and (B)Relations
  ## Try to find an empty line.
  t = s.split('\n\n', 2)

#
  for u in t[1].split('\n',2000):      # 改行文字でデータを行ごとに分割した上で
    mref = re.compile(r"((\w|[-])+)\(((\w|[-])+), ((\w|[-])+)\).*", re.S) # S = DOALL
    m = mref.search(u)       # 正規表現で、abc_d(efg-h, ijk-l) を3つに分解
    if m:
      relation.append([m.group(1), m.group(3), m.group(5)])  # ここまで(B)の固定部分

for v in relation:         # 今後、relationリストの内容を使ってよい
  print 'relation: ' + v[0] + '  head: ' + v[1] + '  tail: ' + v[2]

入力として
Epidemiological studies have suggested that the long-term use of aspirin is associated with a decreased incidence of human malignancies, especially colorectal cancer. Since accumulating evidence indicates that peroxynitrite is critically involved in multistage carcinogenesis, this study was undertaken to investigate the ability of aspirin to inhibit peroxynitrite-mediated DNA damage. Peroxynitrite and its generator 3-morpholinosydnonimine (SIN-1) were used to cause DNA strand breaks in phiX-174 plasmid DNA. We demonstrated that the presence of aspirin at concentrations (0.25-2mM) compatible with amounts in plasma during chronic anti-inflammatory therapy resulted in a significant inhibition of DNA cleavage induced by both peroxynitrite and SIN-1. Moreover, the consumption of oxygen caused by 250muM SIN-1 was found to be decreased in the presence of aspirin, indicating that aspirin might affect the auto-oxidation of SIN-1. Furthermore, EPR spectroscopy using 5,5-dimethylpyrroline-N-oxide (DMPO) as a spin trap demonstrated the formation of DMPO-hydroxyl radical adduct (DMPO-OH) from authentic peroxynitrite, and that aspirin at 0.25-2mM potently diminished the radical adduct formation in a concentration-dependent manner. Taken together, these results demonstrate for the first time that aspirin at pharmacologically relevant concentrations can inhibit peroxynitrite-mediated DNA strand breakage and hydroxyl radical formation. These results may have implications for cancer intervention by aspirin.
を与えたときの実行結果は、次の通り。

relation: amod  head: studies-2  tail: Epidemiological-1
relation: nsubjpass  head: used-63  tail: studies-2
relation: aux  head: suggested-4  tail: have-3
relation: rcmod  head: studies-2  tail: suggested-4
relation: dep  head: associated-12  tail: that-5
relation: det  head: use-8  tail: the-6
relation: amod  head: use-8  tail: long-term-7
relation: nsubjpass  head: associated-12  tail: use-8
relation: prep_of  head: use-8  tail: aspirin-10
relation: auxpass  head: associated-12  tail: is-11
relation: dep  head: suggested-4  tail: associated-12
relation: det  head: incidence-16  tail: a-14
relation: amod  head: incidence-16  tail: decreased-15
relation: prep_with  head: associated-12  tail: incidence-16
relation: amod  head: malignancies-19  tail: human-18
relation: prep_of  head: incidence-16  tail: malignancies-19
relation: dep  head: cancer-23  tail: especially-21
relation: amod  head: cancer-23  tail: colorectal-22
relation: dep  head: undertaken-41  tail: cancer-23
relation: mark  head: indicates-28  tail: Since-25
relation: csubj  head: indicates-28  tail: accumulating-26
relation: dobj  head: accumulating-26  tail: evidence-27
relation: advcl  head: undertaken-41  tail: indicates-28
relation: complm  head: involved-33  tail: that-29
relation: nsubjpass  head: involved-33  tail: peroxynitrite-30
relation: auxpass  head: involved-33  tail: is-31
relation: advmod  head: involved-33  tail: critically-32
relation: ccomp  head: indicates-28  tail: involved-33
relation: amod  head: carcinogenesis-36  tail: multistage-35
relation: prep_in  head: involved-33  tail: carcinogenesis-36
relation: det  head: study-39  tail: this-38
relation: nsubjpass  head: undertaken-41  tail: study-39
relation: auxpass  head: undertaken-41  tail: was-40
relation: dep  head: associated-12  tail: undertaken-41
relation: aux  head: investigate-43  tail: to-42
relation: xcomp  head: undertaken-41  tail: investigate-43
relation: det  head: ability-45  tail: the-44
relation: dobj  head: investigate-43  tail: ability-45
relation: prep_of  head: ability-45  tail: aspirin-47
relation: aux  head: inhibit-49  tail: to-48
relation: xcomp  head: investigate-43  tail: inhibit-49
relation: amod  head: damage-52  tail: peroxynitrite-mediated-50
relation: nn  head: damage-52  tail: DNA-51
relation: dobj  head: inhibit-49  tail: damage-52
relation: dep  head: studies-2  tail: Peroxynitrite-54
relation: poss  head: 3-morpholinosydnonimine-58  tail: its-56
relation: nn  head: 3-morpholinosydnonimine-58  tail: generator-57
relation: conj_and  head: Peroxynitrite-54  tail: 3-morpholinosydnonimine-58
relation: appos  head: studies-2  tail: SIN-1-60
relation: auxpass  head: used-63  tail: were-62
relation: aux  head: cause-65  tail: to-64
relation: xcomp  head: used-63  tail: cause-65
relation: nn  head: breaks-68  tail: DNA-66
relation: amod  head: breaks-68  tail: strand-67
relation: dobj  head: cause-65  tail: breaks-68
relation: amod  head: DNA-72  tail: phiX-174-70
relation: amod  head: DNA-72  tail: plasmid-71
relation: prep_in  head: breaks-68  tail: DNA-72
relation: nsubj  head: demonstrated-2  tail: We-1
relation: complm  head: resulted-22  tail: that-3
relation: det  head: presence-5  tail: the-4
relation: nsubj  head: resulted-22  tail: presence-5
relation: prep_of  head: presence-5  tail: aspirin-7
relation: prep_at  head: aspirin-7  tail: concentrations-9
relation: amod  head: concentrations-9  tail: compatible-13
relation: prep_with  head: compatible-13  tail: amounts-15
relation: prep_in  head: amounts-15  tail: plasma-17
relation: amod  head: therapy-21  tail: chronic-19
relation: amod  head: therapy-21  tail: anti-inflammatory-20
relation: prep_during  head: plasma-17  tail: therapy-21
relation: ccomp  head: demonstrated-2  tail: resulted-22
relation: det  head: inhibition-26  tail: a-24
relation: amod  head: inhibition-26  tail: significant-25
relation: prep_in  head: resulted-22  tail: inhibition-26
relation: nn  head: cleavage-29  tail: DNA-28
relation: prep_of  head: inhibition-26  tail: cleavage-29
relation: partmod  head: cleavage-29  tail: induced-30
relation: det  head: peroxynitrite-33  tail: both-32
relation: agent  head: induced-30  tail: peroxynitrite-33
relation: conj_and  head: peroxynitrite-33  tail: SIN-1-35
relation: advmod  head: induced-30  tail: Moreover-37
relation: det  head: consumption-40  tail: the-39
relation: nsubjpass  head: found-48  tail: consumption-40
relation: prep_of  head: consumption-40  tail: oxygen-42
relation: partmod  head: oxygen-42  tail: caused-43
relation: amod  head: SIN-1-46  tail: 250muM-45
relation: agent  head: caused-43  tail: SIN-1-46
relation: auxpass  head: found-48  tail: was-47
relation: ccomp  head: demonstrated-2  tail: found-48
relation: aux  head: decreased-51  tail: to-49
relation: auxpass  head: decreased-51  tail: be-50
relation: xcomp  head: found-48  tail: decreased-51
relation: det  head: presence-54  tail: the-53
relation: prep_in  head: decreased-51  tail: presence-54
relation: prep_of  head: presence-54  tail: aspirin-56
relation: dep  head: demonstrated-2  tail: indicating-58
relation: complm  head: affect-62  tail: that-59
relation: nsubj  head: affect-62  tail: aspirin-60
relation: aux  head: affect-62  tail: might-61
relation: ccomp  head: indicating-58  tail: affect-62
relation: det  head: auto-oxidation-64  tail: the-63
relation: dobj  head: affect-62  tail: auto-oxidation-64
relation: prep_of  head: auto-oxidation-64  tail: SIN-1-66
relation: advmod  head: demonstrate-49  tail: Furthermore-1
relation: nsubj  head: spectroscopy-4  tail: EPR-3
relation: parataxis  head: demonstrate-49  tail: spectroscopy-4
relation: xcomp  head: spectroscopy-4  tail: using-5
relation: mark  head: demonstrated-14  tail: as-10
relation: det  head: trap-13  tail: a-11
relation: nn  head: trap-13  tail: spin-12
relation: nsubj  head: demonstrated-14  tail: trap-13
relation: det  head: formation-16  tail: the-15
relation: dobj  head: demonstrated-14  tail: formation-16
relation: amod  head: adduct-20  tail: DMPO-hydroxyl-18
relation: amod  head: adduct-20  tail: radical-19
relation: prep_of  head: formation-16  tail: adduct-20
relation: abbrev  head: adduct-20  tail: DMPO-OH-22
relation: amod  head: peroxynitrite-26  tail: authentic-25
relation: prep_from  head: demonstrated-14  tail: peroxynitrite-26
relation: dep  head: diminished-34  tail: that-29
relation: nsubj  head: diminished-34  tail: aspirin-30
relation: prep_at  head: aspirin-30  tail: potently-33
relation: conj_and  head: demonstrated-14  tail: diminished-34
relation: det  head: formation-38  tail: the-35
relation: amod  head: formation-38  tail: radical-36
relation: nn  head: formation-38  tail: adduct-37
relation: dobj  head: diminished-34  tail: formation-38
relation: det  head: manner-42  tail: a-40
relation: amod  head: manner-42  tail: concentration-dependent-41
relation: prep_in  head: formation-38  tail: manner-42
relation: partmod  head: formation-38  tail: Taken-44
relation: advmod  head: Taken-44  tail: together-45
relation: det  head: results-48  tail: these-47
relation: nsubj  head: demonstrate-49  tail: results-48
relation: det  head: time-53  tail: the-51
relation: amod  head: time-53  tail: first-52
relation: prep_for  head: demonstrate-49  tail: time-53
relation: complm  head: inhibit-61  tail: that-54
relation: nsubj  head: inhibit-61  tail: aspirin-55
relation: advmod  head: relevant-58  tail: pharmacologically-57
relation: amod  head: concentrations-59  tail: relevant-58
relation: prep_at  head: aspirin-55  tail: concentrations-59
relation: aux  head: inhibit-61  tail: can-60
relation: ccomp  head: demonstrate-49  tail: inhibit-61
relation: amod  head: breakage-65  tail: peroxynitrite-mediated-62
relation: nn  head: breakage-65  tail: DNA-63
relation: nn  head: breakage-65  tail: strand-64
relation: dobj  head: inhibit-61  tail: breakage-65
relation: nn  head: formation-69  tail: hydroxyl-67
relation: amod  head: formation-69  tail: radical-68
relation: conj_and  head: breakage-65  tail: formation-69
relation: det  head: results-2  tail: These-1
relation: nsubj  head: have-4  tail: results-2
relation: aux  head: have-4  tail: may-3
relation: dobj  head: have-4  tail: implications-5
relation: nn  head: intervention-8  tail: cancer-7
relation: prep_for  head: implications-5  tail: intervention-8
relation: prep_by  head: have-4  tail: aspirin-10

サンプル 第0版  = 第−1版にStanford Relation Tableを追加 (2009/12/25)

Javaのプログラムは、第ー1版と同じ。

Pythonプログラムは、

#!/usr/bin/env python # encoding: utf-8
# -*- coding: utf-8 -*-
# coding: utf-8
##

import sys
import nltk
import re
from subprocess import *

instring = "Lung cancer has become increasingly common in women, and gender differences  \
in the physiology and pathogenesis of the disease have suggested a role for  estrogens."

# Invoke the Java Program "StanfordFromNltk.class" from nltk
nltk.internals.config_java()
p = nltk.internals.java(['StanfordFromNltk'], '/home/yamanouc/src/stanford:/usr/local/stanford-parser/stanford-parser.jar',
stdin=PIPE, stdout=PIPE, blocking=False)

q = p.communicate(input=instring)
s = q[0]
##print s

# Split the file into (A)Tree and (B)Relations
## Try to find an empty line.
t = s.split('\n\n', 2)
#print t[0]  # t[0] contains (A)Tree part
#print '---'
#print t[1]  # t[1] contains (B)Relation part

# Put the (A)Tree part to the "bracket_parse" method
tr = nltk.bracket_parse(t[0])    # ここまでが(A)の固定部分
#print tr    # We got the tree

# Try variou Tree methods        # ここから(A)の利用例
#(1) pick up various nodes
#print tr[0]  # print the 1st node => " subtree "
#print tr[0].node  # print NODE (property) part of the 1st node => "S"
             # Note that the top level does not have the 2nd branch
#print tr[0,0]  # print the 1st node under the 1st node
#print tr[0,0,0]
#print tr[0,0,0,0]
#print "----"
#
#print tr[0,1]  # print the 2nd node under the 1st node -> "(, ,)
#print "----"
#
#(2) Pick up all subtrees in the whole tree
ss = tr.subtrees()  # "subtrees" method creates a "generator"
for u in ss:
  print u    # print all subtrees
print "----"
#
#(2-2) From the subtrees, select Nouns
for u in tr.subtrees():  # Need to invoke "subtrees" every time because it's a generator
  if (u.node == "NN") or (u.node == "NNS"):  #  Recall NN/NNS are used only for leaves
    print u
print "----"
#(2-2-2) If you want only the word part, use "[0]" to extract it.
for u in tr.subtrees():  # Need to invoke "subtrees" every time because it's a generator
  if (u.node == "NN") or (u.node == "NNS"):  #  Recall NN/NNS are used only for leaves
    print u[0]
print "----"
#(2-2-3) You can always pick up NP (Noun Phrase) if necessary.
for u in tr.subtrees():  # Need to invoke "subtrees" every time because it's a generator
  if u.node == "NP":     #   NP is Noun Phrase, i.e., not leaves
    print u
print "----"
#
# Now, interpret the (B) Relation part (second part t[1])   # (A)終り、ここから(B)部分
#
relation = []              # 空のリストを用意した      # (B)の固定部分
for u in t[1].split('\n',2000):      # 改行文字でデータを行ごとに分割した上で
  mref = re.compile(r"((\w|[-])+)\(((\w|[-])+), ((\w|[-])+)\).*", re.S) # S = DOALL
  m = mref.search(u)       # 正規表現で、abc_d(efg-h, ijk-l) を3つに分解
  if m:
    relation.append([m.group(1), m.group(3), m.group(5)])  # ここまで(B)の固定部分

print relation             # ここから(B)の利用例
print "----"
for v in relation:         # 今後、relationリストの内容を使ってよい
  print 'relation: ' + v[0] + '  head: ' + v[1] + '  tail: ' + v[2]

実行結果は(前半は第−1版と同じ)

%python readfromstanford.py
[Found java: /usr/java/default/bin/java]
Loading parser from serialized file /usr/local/stanford-parser-2008-10-26/englishPCFG.ser.gz ... done [3.0 sec].
(ROOT
  (S
    (S
      (NP (NNP Lung) (NN cancer))
      (VP
        (VBZ has)
        (VP
          (VBN become)
          (ADJP (RB increasingly) (JJ common))
          (PP (IN in) (NP (NNS women))))))
    (, ,)
    (CC and)
    (S
      (NP
        (NP (NN gender) (NNS differences))
        (PP
          (IN in)
          (NP
            (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
            (PP (IN of) (NP (DT the) (NN disease))))))
      (VP
        (VBP have)
        (VP
          (VBN suggested)
          (NP
            (NP (DT a) (NN role))
            (PP (IN for) (NP (NNS estrogens)))))))
    (. .)))
(S
  (S
    (NP (NNP Lung) (NN cancer))
    (VP
      (VBZ has)
      (VP
        (VBN become)
        (ADJP (RB increasingly) (JJ common))
        (PP (IN in) (NP (NNS women))))))
  (, ,)
  (CC and)
  (S
    (NP
      (NP (NN gender) (NNS differences))
      (PP
        (IN in)
        (NP
          (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
          (PP (IN of) (NP (DT the) (NN disease))))))
    (VP
      (VBP have)
      (VP
        (VBN suggested)
        (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))))))
  (. .))
(S
  (NP (NNP Lung) (NN cancer))
  (VP
    (VBZ has)
    (VP
      (VBN become)
      (ADJP (RB increasingly) (JJ common))
      (PP (IN in) (NP (NNS women))))))
(NP (NNP Lung) (NN cancer))
(NNP Lung)
(NN cancer)
(VP
  (VBZ has)
  (VP
    (VBN become)
    (ADJP (RB increasingly) (JJ common))
    (PP (IN in) (NP (NNS women)))))
(VBZ has)
(VP
  (VBN become)
  (ADJP (RB increasingly) (JJ common))
  (PP (IN in) (NP (NNS women))))
(VBN become)
(ADJP (RB increasingly) (JJ common))
(RB increasingly)
(JJ common)
(PP (IN in) (NP (NNS women)))
(IN in)
(NP (NNS women))
(NNS women)
(, ,)
(CC and)
(S
  (NP
    (NP (NN gender) (NNS differences))
    (PP
      (IN in)
      (NP
        (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
        (PP (IN of) (NP (DT the) (NN disease))))))
  (VP
    (VBP have)
    (VP
      (VBN suggested)
      (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))))))
(NP
  (NP (NN gender) (NNS differences))
  (PP
    (IN in)
    (NP
      (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
      (PP (IN of) (NP (DT the) (NN disease))))))
(NP (NN gender) (NNS differences))
(NN gender)
(NNS differences)
(PP
  (IN in)
  (NP
    (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
    (PP (IN of) (NP (DT the) (NN disease)))))
(IN in)
(NP
  (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
  (PP (IN of) (NP (DT the) (NN disease))))
(NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
(DT the)
(NN physiology)
(CC and)
(NN pathogenesis)
(PP (IN of) (NP (DT the) (NN disease)))
(IN of)
(NP (DT the) (NN disease))
(DT the)
(NN disease)
(VP
  (VBP have)
  (VP
    (VBN suggested)
    (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))))
(VBP have)
(VP
  (VBN suggested)
  (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))))
(VBN suggested)
(NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))
(NP (DT a) (NN role))
(DT a)
(NN role)
(PP (IN for) (NP (NNS estrogens)))
(IN for)
(NP (NNS estrogens))
(NNS estrogens)
(. .)
----
(NN cancer)
(NNS women)
(NN gender)
(NNS differences)
(NN physiology)
(NN pathogenesis)
(NN disease)
(NN role)
(NNS estrogens)
----
cancer
women
gender
differences
physiology
pathogenesis
disease
role
estrogens
----
(NP (NNP Lung) (NN cancer))
(NP (NNS women))
(NP
  (NP (NN gender) (NNS differences))
  (PP
    (IN in)
    (NP
      (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
      (PP (IN of) (NP (DT the) (NN disease))))))
(NP (NN gender) (NNS differences))
(NP
  (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
  (PP (IN of) (NP (DT the) (NN disease))))
(NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
(NP (DT the) (NN disease))
(NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))
(NP (DT a) (NN role))
(NP (NNS estrogens))
----
[['nn', 'cancer-2', 'Lung-1'], ['nsubj', 'common-6', 'cancer-2'], ['aux', 'common-6', 'has-3'], 
['cop', 'common-6', 'become-4'], ['advmod', 'common-6', 'increasingly-5'], ['prep_in', 'common-6', 'women-8'], ['nn', 'differences-12', 'gender-11'], 
['nsubj', 'suggested-22', 'differences-12'], ['det', 'physiology-15', 'the-14'], ['prep_in', 'differences-12', 'physiology-15'], 
['conj_and', 'physiology-15', 'pathogenesis-17'], ['det', 'disease-20', 'the-19'], ['prep_of', 'physiology-15', 'disease-20'], 
['aux', 'suggested-22', 'have-21'], ['conj_and', 'common-6', 'suggested-22'], ['det', 'role-24', 'a-23'], 
['dobj', 'suggested-22', 'role-24'], ['prep_for', 'role-24', 'estrogens-26']]
----
relation: nn  head: cancer-2  tail: Lung-1
relation: nsubj  head: common-6  tail: cancer-2
relation: aux  head: common-6  tail: has-3
relation: cop  head: common-6  tail: become-4
relation: advmod  head: common-6  tail: increasingly-5
relation: prep_in  head: common-6  tail: women-8
relation: nn  head: differences-12  tail: gender-11
relation: nsubj  head: suggested-22  tail: differences-12
relation: det  head: physiology-15  tail: the-14
relation: prep_in  head: differences-12  tail: physiology-15
relation: conj_and  head: physiology-15  tail: pathogenesis-17
relation: det  head: disease-20  tail: the-19
relation: prep_of  head: physiology-15  tail: disease-20
relation: aux  head: suggested-22  tail: have-21
relation: conj_and  head: common-6  tail: suggested-22
relation: det  head: role-24  tail: a-23
relation: dobj  head: suggested-22  tail: role-24
relation: prep_for  head: role-24  tail: estrogens-26

サンプル 第−1版  = 第−2版にStanfordパーザー呼び出しを追加 (2009/12/25)

Stanfordパーザーを呼ぶJavaのプログラム(classファイル)を予め作っておく。ソースは

import java.io.*;
import java.util.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class StanfordFromNltk{
  public static void main(String[] args) {
    LexicalizedParser lp = new LexicalizedParser("/usr/local/stanford-parser-2008-10-26/englishPCFG.ser.gz");
    lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

    String sent = "";
    try{
      BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
      // String sent = "This is an easy sentense.";
      sent = br.readLine();
      br.close();
    }
    catch(IOException e){
      System.out.println("Input Error");
    }
    Tree parse = (Tree) lp.apply(sent);
//    parse.pennPrint();
//    System.out.println();

//    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
//    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
//    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
//    Collection tdl = gs.typedDependenciesCollapsed();
//    System.out.println(tdl);
//    System.out.println();
//
    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }
}

これをjavacでコンパイルして、ファイルStanfordFromNltk.classを作っておく。

次に、これを呼出すPythonのプログラムは次の通り。

#!/usr/bin/env python # encoding: utf-8
import sys
import nltk
from subprocess import *             # これが必要らしい

instr = "Lung cancer has become increasingly common in women, and gender differences \
in the physiology and pathogenesis of the disease have suggested a role for estrogens."

# Invoke the Java Program "StanfordFromNltk.class" from nltk
nltk.internals.config_java()
p = nltk.internals.java(['StanfordFromNltk'], '/home/yamanouc/src/stanford:/usr/local/stanford-parser/stanford-parser.jar', 
stdin=PIPE, stdout=PIPE, blocking=False)
#             # Javaプログラム呼出し。stdin=PIPEでパイプから読込み。stdout=PIPEでパイプへ書出し。
#             #          blocking=Falseで入力PIPEが許される〜プロセス並行動作
q = p.communicate(input=instr) # javaの実行。入力はinstrからPIPE、出力は戻り値が(stdout, stderr)
s = q[0]      # タプル(stdout, stderr)の0番目、つまりstdout
print s
#
#============================================
#                                # ここからあとは、Sample 第−2版と同じ
# Split the file into (A)Tree and (B)Relations
## Try to find an empty line.
t = s.split('\n\n', 2)           # sで拾った入力(=Stanfordの出力)を分割
#print t[0]  # t[0] contains (A)Tree part
#print '---'
#print t[1]  # t[1] contains (B)Relation part

# Input the (A)Tree part to the "bracket_parse" method
tr = nltk.bracket_parse(t[0])
#print tr    # We got the tree

# Try variou Tree methods
#(1) pick up various nodes
#print tr[0]  # print the 1st node => " subtree "
#print tr[0].node  # print NODE (property) part of the 1st node => "S"
             # Note that the top level does not have the 2nd branch
#print tr[0,0]  # print the 1st node under the 1st node
#print tr[0,0,0]
#print tr[0,0,0,0]
#print "----"
#
#print tr[0,1]  # print the 2nd node under the 1st node -> "(, ,)
#print "----"
#
#(2) Pick up all subtrees in the whole tree
ss = tr.subtrees()  # "subtrees" method creates a "generator"
for u in ss:
  print u    # print all subtrees
print "----"
#
#(2-2) From the subtrees, select Nouns
for u in tr.subtrees():  # Need to invoke "subtrees" every time because it's a generator
  if (u.node == "NN") or (u.node == "NNS"):  #  Recall NN/NNS are used only for leaves
    print u
print "----"
#(2-2-2) If you want only the word part, use "[0]" to extract it.
for u in tr.subtrees():  # Need to invoke "subtrees" every time because it's a generator
  if (u.node == "NN") or (u.node == "NNS"):  #  Recall NN/NNS are used only for leaves
    print u[0]
print "----"
#(2-2-3) You can always pick up NP (Noun Phrase) if necessary.
for u in tr.subtrees():  # Need to invoke "subtrees" every time because it's a generator
  if u.node == "NP":     #   NP is Noun Phrase, i.e., not leaves
    print u
print "----"

サンプル 第−2版 (2009/12/25)

Stanfordパーザーを呼出すところも、関係部分を読み込むところも、未だサボっているバージョン。呼出す代わりにパーザーの出力をファイルとして置いたものを読んでいる。

#!/usr/bin/env python # encoding: utf-8
import sys
import nltk

f = open('ParserDemoMore.out')    # このファイルにパーザーの出力がある
s = f.read()

# Split the file into (A)Tree and (B)Relations; パーザー出力をトリーと関係に分割
## Try to find an empty line.
t = s.split('\n\n', 2)
#print t[0]  # t[0] contains (A)Tree part
#print '---'
#print t[1]  # t[1] contains (B)Relation part

# Input the (A)Tree part to the "bracket_parse" method; トリー部分をpythonに読み込み
tr = nltk.bracket_parse(t[0])
#print tr    # We got the tree

# Try variou Tree methods
#(1) pick up various nodes
#print tr[0]  # print the 1st node => " subtree "
#print tr[0].node  # print NODE (property) part of the 1st node => "S"
             # Note that the top level does not have the 2nd branch
#print tr[0,0]  # print the 1st node under the 1st node
#print tr[0,0,0]
#print tr[0,0,0,0]
#print "----"
#
#print tr[0,1]  # print the 2nd node under the 1st node -> "(, ,)
#print "----"
#
#print tr[0,1]  # print the 2nd node under the 1st node -> "(, ,)
#print "----"
#
#(2) Pick up all subtrees in the whole tree
ss = tr.subtrees()  # "subtrees" method creates a "generator"
for u in ss:
  print u    # print all subtrees
print "----"
#
#(2-2) From the subtrees, select Nouns
for u in tr.subtrees():  # Need to invoke "subtrees" every time because it's a generator
  if (u.node == "NN") or (u.node == "NNS"):  #  Recall NN/NNS are used only for leaves
    print u
print "----"
#(2-2-2) If you want only the word part, use "[0]" to extract it.
for u in tr.subtrees():  # Need to invoke "subtrees" every time because it's a generator
  if (u.node == "NN") or (u.node == "NNS"):  #  Recall NN/NNS are used only for leaves
    print u[0]
print "----"
#(2-2-3) You can always pick up NP (Noun Phrase) if necessary.
for u in tr.subtrees():  # Need to invoke "subtrees" every time because it's a generator
  if u.node == "NP":     #   NP is Noun Phrase, i.e., not leaves
    print u
print "----"

これの出力は、

(ROOT
  (S
    (S
      (NP (NNP Lung) (NN cancer))
      (VP
        (VBZ has)
        (VP
          (VBN become)
          (ADJP (RB increasingly) (JJ common))
          (PP (IN in) (NP (NNS women))))))
    (, ,)
    (CC and)
    (S
      (NP
        (NP (NN gender) (NNS differences))
        (PP
          (IN in)
          (NP
            (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
            (PP (IN of) (NP (DT the) (NN disease))))))
      (VP
        (VBP have)
        (VP
          (VBN suggested)
          (NP
            (NP (DT a) (NN role))
            (PP (IN for) (NP (NNS estrogens)))))))
    (. .)))
(S
  (S
    (NP (NNP Lung) (NN cancer))
    (VP
      (VBZ has)
      (VP
        (VBN become)
        (ADJP (RB increasingly) (JJ common))
        (PP (IN in) (NP (NNS women))))))
  (, ,)
  (CC and)
  (S
    (NP
      (NP (NN gender) (NNS differences))
      (PP
        (IN in)
        (NP
          (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
          (PP (IN of) (NP (DT the) (NN disease))))))
    (VP
      (VBP have)
      (VP
        (VBN suggested)
        (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))))))
  (. .))
(S
  (NP (NNP Lung) (NN cancer))
  (VP
    (VBZ has)
    (VP
      (VBN become)
      (ADJP (RB increasingly) (JJ common))
      (PP (IN in) (NP (NNS women))))))
(NP (NNP Lung) (NN cancer))
(NNP Lung)
(NN cancer)
(VP
  (VBZ has)
  (VP
    (VBN become)
    (ADJP (RB increasingly) (JJ common))
    (PP (IN in) (NP (NNS women)))))
(VBZ has)
(VP
  (VBN become)
  (ADJP (RB increasingly) (JJ common))
  (PP (IN in) (NP (NNS women))))
(VBN become)
(ADJP (RB increasingly) (JJ common))
(RB increasingly)
(JJ common)
(PP (IN in) (NP (NNS women)))
(IN in)
(NP (NNS women))
(NNS women)
(, ,)
(CC and)
(S
  (NP
    (NP (NN gender) (NNS differences))
    (PP
      (IN in)
      (NP
        (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
        (PP (IN of) (NP (DT the) (NN disease))))))
  (VP
    (VBP have)
    (VP
      (VBN suggested)
      (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))))))
(NP
  (NP (NN gender) (NNS differences))
  (PP
    (IN in)
    (NP
      (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
      (PP (IN of) (NP (DT the) (NN disease))))))
(NP (NN gender) (NNS differences))
(NN gender)
(NNS differences)
(PP
  (IN in)
  (NP
    (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
    (PP (IN of) (NP (DT the) (NN disease)))))
(IN in)
(NP
  (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
  (PP (IN of) (NP (DT the) (NN disease))))
(NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
(DT the)
(NN physiology)
(CC and)
(NN pathogenesis)
(PP (IN of) (NP (DT the) (NN disease)))
(IN of)
(NP (DT the) (NN disease))
(DT the)
(NN disease)
(VP
  (VBP have)
  (VP
    (VBN suggested)
    (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))))
(VBP have)
(VP
  (VBN suggested)
  (NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens)))))
(VBN suggested)
(NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))
(NP (DT a) (NN role))
(DT a)
(NN role)
(PP (IN for) (NP (NNS estrogens)))
(IN for)
(NP (NNS estrogens))
(NNS estrogens)
(. .)
----
(NN cancer)
(NNS women)
(NN gender)
(NNS differences)
(NN physiology)
(NN pathogenesis)
(NN disease)
(NN role)
(NNS estrogens)
----
cancer
women
gender
differences
physiology
pathogenesis
disease
role
estrogens
----
(NP (NNP Lung) (NN cancer))
(NP (NNS women))
(NP
  (NP (NN gender) (NNS differences))
  (PP
    (IN in)
    (NP
      (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
      (PP (IN of) (NP (DT the) (NN disease))))))
(NP (NN gender) (NNS differences))
(NP
  (NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
  (PP (IN of) (NP (DT the) (NN disease))))
(NP (DT the) (NN physiology) (CC and) (NN pathogenesis))
(NP (DT the) (NN disease))
(NP (NP (DT a) (NN role)) (PP (IN for) (NP (NNS estrogens))))
(NP (DT a) (NN role))
(NP (NNS estrogens))
----

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2010-01-11 (月) 17:52:19 (2748d)