PerlのHTML::TagParserを使ったHTML解析

前回の記事に引き続きPerlです。今回はURLから取得したHTMLのパースを行いました。

www.konosumi.net

Perlの開発環境準備

まずは開発環境を準備します。「-it /bin/bash」で、起動したDockerコンテナーの中に入ります。

docker run -v $(pwd):/work -it --rm perl:5.30 /bin/bash

次にCPANモジュールをインストールします。

cpanm JSON LWP::UserAgent
cpanm LWP::Protocol::https
cpanm HTML::TagParser

/workをマウントしているので、ディレクトリ移動したら準備完了です。

# 「-v $(pwd):/work」で、ホスト側のカレントディレクトリをコンテナーの/workにマウントしているため
cd /work/

HTML::TagParserの概要

HTML::TagParserは、その名の通りHTMLのパーサーです。 metacpanにメソッドの一覧が書いてあるので、ここを見るとだいたいの機能がわかります。

metacpan.org

自分のブログから、ブログ名と記事タイトルを抜き出してみる

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TagParser;
use Data::Dumper;
use LWP::UserAgent;
use URI;
use HTML::Entities qw/decode_entities/;

sub getHtmlParser {
    my ($url) = @_;

    my $ua = LWP::UserAgent->new;
    my $uri = URI->new($url);
    my $res = $ua->get($uri);
    return HTML::TagParser->new( $res->content );
}

my $html = getHtmlParser('https://www.konosumi.net/');

# ブログタイトルを取得する
warn decode_entities($html->getElementById('blog-title')->innerText());
# このすみエンジニアブログ at sample.pl line 21.

# ブログの記事一覧から、記事タイトルを抜き出して表示する
my @elements = $html->getElementsByClassName('entry-title');
foreach my $elem (@elements) {
    warn decode_entities($elem->innerText());
    # Perlを復習しながら、HTTP(S)クライアントとJSONパースからのCSVファイル出力を書いた at sample.pl line 25.
    # テレワーク・リモートワークのための、環境整備のガイドライン at sample.pl line 25.
    # ...
}

HTML::TagParserの使い方

クラスインスタンスの生成とパース

クラスインスタンス生成時に、パースしたいURL・ファイルまたはHTML文字列のいずれかを与えるのが、もっともお手軽です。

# If new() is called with a URL, this method fetches a HTML file from remote web server and parses it and returns its instance. URI::Fetch module is required to fetch a file.
$html = HTML::TagParser->new( $url );

# If new() is called with a filename, this method parses a local HTML file and returns its instance
$html = HTML::TagParser->new( $file );

# If new() is called with a string of HTML source code, this method parses it and returns its instance.
$html = HTML::TagParser->new( "<html>...snip...</html>" );

もしくはクラスインスタンスの生成後にパースします。

my $html = HTML::TagParser->new;

# This method parses a local HTML file.
$html->open( $file );

# This method parses a string of HTML source code.
$html->parse( $source );

要素（DOM）の取得

JavaScirptに似たメソッド名ですが、getElementBy〜のメソッドを使います。

# ID
# This method returns the element which id attribute is $id.
$elem = $html->getElementById( $id );

# name要素
# This method returns an array of elements which name attribute is $name. On scalar context, the first element is only retruned.
@elem = $html->getElementsByName( $name );

# タグ名
# This method returns an array of elements which tagName is $tagName. On scalar context, the first element is only retruned.
@elem = $html->getElementsByTagName( $tagname );

# クラス名
# This method returns an array of elements which className is $tagName. On scalar context, the first element is only retruned.
@elem = $html->getElementsByClassName( $class );

# 任意のattribute
# This method returns an array of elements which $attrname attribute's value is $value. On scalar context, the first element is only retruned.
@elem = $html->getElementsByAttribute( $attrname, $value );

getElementBy〜のメソッドによって、HTML::TagParser::Elementが得られます。 HTML::TagParser::Elementには$elem->firstChild();や$elem->childNodes();、$elem->parentNode();といった関数があるので、そこを起点にたどることも可能です。