PythonでBeautiful Soupを試してみる。

PythonでHTMLの解析を行うのによさそうだったので、
試してみることにした。

まず必要なのが、

BeautifulSoup

まずインストール（ってかファイルをPythonが読める場所にBeautifulSoup.pyを置くだけ。）

ここからダウンロードする。
それを今回は「site-packages」の下に置いてみた。
清水川Webを参考にさせてもらったためです。

やってみます。

>>> import urllib2
>>> opener = urllib2.build_opener()
>>> html = opener.open(‘https://kishi-r.com/2008/02/ubuntu_1.html’).read()
>>> print html

HTML内の情報が表示されました。
ここでBeautifulSoupを使って「title」のみを解析してみます。

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.find(‘title’,)
<title>kishi-r.com: Ubuntuをインストールしました。</title>

取れたっすね。

他にも「td」を検索してみる。

>>> soup.findAll(‘td’,)
[<td>
</td>, <td>
</td>, <td>
</td>, <td>
</td>, <td>
</td>, <td><span class=”today”>
<a href=”https://kishi-r.com/2008/02/01/”>1</a>
</span></td>, <td><span class=”calendar”>
2</span></td>, <td><span class=”calendar”>
3</span></td>, <td><span class=”calendar”>
4</span></td>, <td><span class=”calendar”>
5</span></td>, <td><span class=”calendar”>
6</span></td>, <td><span class=”calendar”>
7</span></td>, <td><span class=”calendar”>
8</span></td>, <td><span class=”calendar”>
9</span></td>, <td><span class=”calendar”>
10</span></td>, <td><span class=”calendar”>
11</span></td>, <td><span class=”calendar”>
12</span></td>, <td><span class=”calendar”>
13</span></td>, <td><span class=”calendar”>
14</span></td>, <td><span class=”calendar”>
15</span></td>, <td><span class=”calendar”>
16</span></td>, <td><span class=”calendar”>
17</span></td>, <td><span class=”calendar”>
18</span></td>, <td><span class=”calendar”>
19</span></td>, <td><span class=”calendar”>
20</span></td>, <td><span class=”calendar”>
21</span></td>, <td><span class=”calendar”>
22</span></td>, <td><span class=”calendar”>
23</span></td>, <td><span class=”calendar”>
24</span></td>, <td><span class=”calendar”>
25</span></td>, <td><span class=”calendar”>
26</span></td>, <td><span class=”calendar”>
27</span></td>, <td><span class=”calendar”>
28</span></td>, <td><span class=”calendar”>
29</span></td>, <td>
</td>]

で取れました。

このBeautifulSoupを使った記事でおもしろそうだったのが、
Googleニュースで指定した単語を検索して、ヒット数を返すサンプルなどがありました。
とりあえず試してみようっと。

投稿者:

kishir

趣味： sk8, ピスト、ターンテーブル、レコード仕事： Python, Objective-C, PHP, JavaScript kishir の投稿をすべて表示

6 thoughts on “PythonでBeautiful Soupを試してみる。”

nobu より:

2008年2月4日 2:16 PM

lxmlで良いんじゃ？

返信
kishi-r より:

2008年2月4日 2:34 PM

うん、確かにｗ
まぁ～あったので、
ちょっと試してみたかったっす。

返信
JunJun より:

2011年11月10日 11:38 PM

watanabeさん、

BeautifulSoup-3.2.0 を　site-packages　の下に置いたのにwatanabeさんのプログラム全然動きません。どうしたら良いでしょうか？？？

お願いします。

返信
kishir より:

2011年11月18日 1:31 AM

> JunJunさん

もしや、「BeautifulSoup-3.2.0」ディレクトリをそのまま置いただけとかですかね？
その場合、「BeautifulSoup-3.2.0」にある「BeautifulSoup.py」を「site-packages」直下にコピーしてあげるとうまくいきそうな感じがします。

それでも解決しない場合、JunJunさんの

・環境
・エラー内容

など解る範囲で教えてもらえるとありがたいです。
ちなみに私の環境は、

・MacOSXLion Python 2.5.5
・MacOSXLion Python 2.6.5
・Ubuntu10.04 Python2.6.5

で試したところ問題無く動作しておりました。
MacのPythonはmac portsでインストールしたので、下記へ一部treeを記載しておきます。

/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages$ tree
|– BeautifulSoup-3.2.0
| |– BeautifulSoup.py
| |– BeautifulSoupTests.py
| |– PKG-INFO
| `– setup.py
|– BeautifulSoup-3.2.0.tar.gz
|– BeautifulSoup.py

ちなみにUbuntuとかだと、aptitudeなどでイントールした場合、

・/usr/lib/python2.6/site-packages

直下とかに置けばOKなハズです。
# ソースからコンパイルしたらそれに併せてください

返信
sala より:

2013年12月11日 11:57 AM

>>>soup.findAll(”,)
>>>
上記で何もでないのはどこが間違っていますか？
是非教えて頂ければと思います。

返信
kishir より:

2014年3月11日 2:47 PM

@salaさん

何も指定していないためでは無いでしょうか？

>>> soup.findAll(‘td’,)

や

>>> soup.findAll(‘div’,)

などで試してみてくださいませ。

返信

コメントを残すコメントをキャンセル