HTML::ExtractText - extract multiple text strings from HTML content, using CSS selectors
At its simplest; use CSS selectors:
We can go fancy pants with selectors as well as extract more than one bit of text:
We can also pass in an object and let the extractor call setter methods on it when it extracts text:
The module allows to extract [multiple] text strings from HTML documents, using CSS selectors to declare what text needs extracting. The module can either return the results as a hashref or automatically call setter methods on a provided object.
If you're looking for extra automatic post-processing and laxer definition of what constitutes "text", see HTML::ExtractText::Extra.
$extractor->extract(
{ stuff => 'title', },
'<title>My html code!</title>',
bless {}, 'Foo',
) or die "Extraction error: $extractor";
print "Title is: $extractor->{stuff}\n\n";
The module incorporates two overloaded methods ->error()
, which
is overloaded for interpolation (use overload q|""| ...
),
and ->last_result()
,
which is overloaded for hash dereferencing
(use overload q|%{}| ...
).
What this means is that you can interpolate the object in a string
to retrieve the error message and you can use the object as a hashref
to access the hashref returned by ->last_results()
.
->new()
my $extractor = HTML::ExtractText->new;
my $extractor = HTML::ExtractText->new(
separator => "\n",
ignore_not_found => 1,
); # default values for arguments are shown
Creates and returns new HTML::ExtractText
object. Takes optional
arguments as key/value pairs:
separator
my $extractor = HTML::ExtractText->new(
separator => "\n", # default value
);
my $extractor = HTML::ExtractText->new(
separator => undef,
);
Optional. Default: \n
(new line).
Takes undef
or a string as a value.
Specifies what to do when CSS selector matches multiple
elements. If set to a string value, text from all the matching
elements will be joined using that string. If set to undef
,
no joining will happen and results will be returned as arrayrefs
instead of strings (even if selector matches a single element).
ignore_not_found
my $extractor = HTML::ExtractText->new(
ignore_not_found => 1, # default value
);
my $extractor = HTML::ExtractText->new(
ignore_not_found => 0,
);
Optional. Default: 1
(true). Takes true or false values
as a value. Specifies whether to consider it an error when any
of the given selectors match nothing. If set to a true value,
any non-matching selectors will have empty strings as values and no
errors will be reported. If set to a false value, all selectors must
match at least one element or the module will error out.
->extract()
my $results = $extractor->extract(
{ stuff => 'title', },
'<title>My html code!</title>',
$some_object, # optional
) or die "Extraction error: $extractor";
print "Title is: $extractor->{stuff}\n\n";
# $extractor->{stuff} is the same as $results->{stuff}
Takes two mandatory and one optional arguments. Extracts text from
given HTML code and returns a hashref with results (
see ->last_results()
method
). On error, returns
undef
or empty list and the error will be available via
->error()
method. Even if errors occurred, anything that
was successfully extracted will still be available through
->last_results()
method.
$extractor->extract(
{ stuff => 'title', },
... ,
... ,
) or die "Extraction error: $extractor";
Must be a hashref. The keys can be whatever you want; you will use them to refer to the extracted text. The values must be CSS selectors that match the elements you want to extract text from. All the selectors listed on https://metacpan.org/pod/Mojo::DOM::CSS#SELECTORS are supported.
Note: the values will be modified in place in the original hashref you provided, so you can use that to your advantage, if needed.
$extractor->extract(
... ,
'<title>My html code!</title>',
... ,
) or die "Extraction error: $extractor";
Takes a string that is HTML code you're trying to extract text from.
$extractor->extract(
{ stuff => 'title', },
'<title>My html code!</title>',
$some_object,
) or die "Extraction error: $extractor";
# this is what is being done automatically, during extraction,
# for each key in the first argument of ->extract():
# $some_object->stuff( $extractor->{stuff} );
Optional. No defaults. For convenience, you can supply an object and
HTML::ExtractText
will call methods on it. The called methods
will be the keys of the first argument given to ->extract()
and
the extracted text will be given to those methods as the first argument.
->error()
$extractor->extract(
{ stuff => 'title', },
'<title>My html code!</title>',
) or die "Extraction error: " . $extractor->error;
$extractor->extract(
{ stuff => 'title', },
'<title>My html code!</title>',
) or die "Extraction error: $extractor";
Takes no arguments. Returns the error message as a string, if any occurred
during the last call to ->extract()
. Note that
->error()
will only return one of the error messages, even
if more than one selector failed. Examine the hashref returned
by ->last_results()
to find all the errors;
for any selector that errored out, the value will begin with
"ERROR: "
and the error message will be there.
->last_results()
$extractor->extract(
{ stuff => 'title', },
'<title>My html code!</title>',
) or die "Extraction error: $extractor";
print "Stuff is " . $extractor->last_results->{stuff} . "\n";
# or
print "Stuff is $extractor->{stuff}\n";
Takes no arguments. Returns the same hashref
the last call to ->extract
did. If ->extract
failed, you can still use ->last_results()
to get
anything that didn't error out (the error messages will be in the values
of failed keys).
The hashref will contain the same keys as the first argument
to ->extract()
had and the values will be replaced with
whatever the selectors matched.
If separator
(see ->new()
) is set to undef
, the values
will be arrayrefs, with each item in those arrayrefs corresponding
to one matched element in HTML.
The module will attempt to DWIM (Do What I Mean) when selector matches
form controls or images, and use value=""
or alt=""
attributes
as text sources.
->separator()
$extractor->separator("\n");
$extractor->separator(undef);
Accessor to separator
option (see ->new()
).
Takes one optional argument, which if provided, will become the
new separator.
->ignore_not_found()
$extractor->ignore_not_found(1);
$extractor->ignore_not_found(0);
Accessor to ignore_not_found
option (see ->new()
).
Takes one optional argument, which if provided, will become the
new value of ignore_not_found
option.
sub _extract {
my ( $self, $dom, $selector, $what ) = @_;
return $dom->find( $what->{ $selector } )
->map( sub { $self->_process( @_ ) } )->each;
}
You can subclass this module by overriding either or both
_extract
and _process
methods. Their names and purpose
are guaranteed to remain unchanged. See source code for their default
implementation.
This module does not automatically encode extracted text, so the examples in this documentation should really include something akin to:
use Encode;
my $title = encode 'utf8', $ext->{page_title};
print "$title\n";
HTML::ExtractText::Extra - a subclass that offers extra features
Mojo::DOM, Text::Balanced, HTML::Extract
You can use and distribute this module under the same terms as Perl itself.
See the LICENSE
file included in this distribution for complete
details.