Elixir tips'n'tricks: Floki find elements by text using a regular expression
Floki is a great Elixir library used to parse HTML and query the result using CSS selectors.
CSS selectors have one big limitation: you can’t query by node’s text content, the standard still doesn’t have a support for it.
To overcome this limitation Floki added the support for the contains selector, mostly known as the jQuery way of searching through DOM by text.
The syntax for the Floki filter is
Floki.find(html, "p:fl-contains('text')")
The main problem with it is that it only does a strict, case sensitive comparision match between text and filter. In our example it will only match nodes containing the ‘Two’ string.
You can’t match begins-with
or ends-with
, not even do a case insensitive search.
After studying a little bit how Floki pseudo classes have been implemented, I realized I could use regular expressions as filters gaining much more flexibility.
As you can see in the implementation text content is matched using the =~
operator, which means you can pass a regex on the right side.
# contains match implementation
%Text{content: content} -> content =~ value
value
is a field insided the struct
Floki.Selector.PseudoClass
, passed inside the struct
Floki.Selector
which is the one you pass to Floki.find
that gets built from the CSS string selectors.
Floki.Selector
has this structure
defstruct id: nil,
type: nil,
classes: [],
attributes: [],
namespace: nil,
pseudo_class: nil,
combinator: nil
pseudo_class
is the field used to pass our fl-contains
pseudo selector that is matched here by this code
defp pseudo_class_match?(html_node, pseudo_class, tree) do
case pseudo_class.name do
...
"fl-contains" ->
PseudoClass.match_contains?(tree, html_node, pseudo_class)
end
Now that we know how Floki builds Floki.Selector
structs and matches against them, it’s easy to craft a struct to pass a regex to the filter engine.
html = """
<div>abc123</div>
<div><span>./---DeFfff/</span></div>
"""
selector = %Floki.Selector{
pseudo_classes: [
%Floki.Selector.PseudoClass{
name: "fl-contains",
value: ~r/abc|def/i
}
]
}
Floki.find(Floki.parse(html), selector)
[{"div", [], ["abc123"]}, {"span", [], ["./---DeFfff/"]}]