Floki is a great Elixir library used to parse HTML and query the result using CSS selectors.

CSS selectors have one big limitation: you can’t query by node’s text content, the standard still doesn’t have a support for it.

To overcome this limitation Floki added the support for the contains selector, mostly known as the jQuery way of searching through DOM by text.

The syntax for the Floki filter is

Floki.find(html, "p:fl-contains('text')")

The main problem with it is that it only does a strict, case sensitive comparision match between text and filter. In our example it will only match nodes containing the ‘Two’ string.

You can’t match begins-with or ends-with, not even do a case insensitive search.

After studying a little bit how Floki pseudo classes have been implemented, I realized I could use regular expressions as filters gaining much more flexibility.

As you can see in the implementation text content is matched using the =~ operator, which means you can pass a regex on the right side.

# contains match implementation
%Text{content: content} -> content =~ value

value is a field insided the struct Floki.Selector.PseudoClass, passed inside the struct Floki.Selector which is the one you pass to Floki.find that gets built from the CSS string selectors.

Floki.Selector has this structure

defstruct id: nil,
          type: nil,
          classes: [],
          attributes: [],
          namespace: nil,
          pseudo_class: nil,
          combinator: nil

pseudo_class is the field used to pass our fl-contains pseudo selector that is matched here by this code

defp pseudo_class_match?(html_node, pseudo_class, tree) do
  case pseudo_class.name do
    ...

    "fl-contains" ->
      PseudoClass.match_contains?(tree, html_node, pseudo_class)

end

Now that we know how Floki builds Floki.Selector structs and matches against them, it’s easy to craft a struct to pass a regex to the filter engine.


html = """
<div>abc123</div>
<div><span>./---DeFfff/</span></div>
"""

selector = %Floki.Selector{
  pseudo_classes: [
    %Floki.Selector.PseudoClass{
      name: "fl-contains",
      value: ~r/abc|def/i
    }
  ]
}

Floki.find(Floki.parse(html), selector)
[{"div", [], ["abc123"]}, {"span", [], ["./---DeFfff/"]}]