Categories

web scraping

Programming Ideas by Inhahe · just now · edited
☆☆☆☆☆ No ratings yet

Like ElementTree, XPath, lxml, etc., but better.

example of the type of problem that could be made easier:

"
Say I have the following HTML (I hope this shows up as plain text here rather than formatting):

<div style="font-size: 20pt;"><span style="color: #000000;"><em><strong>"Is today the day?"</strong></em></span></div>

And I want to extract the "Is today the day?" part. There are other places in the document with <em> and <strong>, but this is the only place that uses color #000000, so I want to extract anything that's within a color #000000 style, even if it's nested multiple levels deep within that.

  • Sometimes the color is defined as RGB(0, 0, 0) and sometimes it's defined as #000000
  • Sometimes the <strong> is within the <em> and sometimes the <em> is within the <strong>.
  • There may be other discrepancies I haven't noticed yet

How can I do this in BeautifulSoup (or is this better done in lxml.html)?
"

so, for example, we could have an expression that matches nodes between X and Y levels under the current node (along with any other possible restrictions, like class name, etc.)
XPath has // which will find something any number of levels deep under the current node, but still.
we could also have an equaivalent to 'or' in matching expressions so we could do <strong> or <em> or whatever.

there's a lot of things that could be done. there should be a whole language for it akin to regular expressions (but substantially different).

other examples of things we could express:
- that somewhere in the hierachy between node x and sub-node y there needs to be a <strong>, at any level
- that somewhere in the hierachy between node x and sub-node y there needs to be a <strong>, at any level AND an <em> at any level
- specifically for html, ways to universalize things like, e.g., color so that we don't have to check for e.g. "color: #000000", "color:#000000", "RGB(0, 0, 0)", "RGB(0,0,0)", etc. etc.

(i wrote the above before CSS was a thing)