Skip to content
This repository has been archived by the owner on May 20, 2019. It is now read-only.

cathalgarvey/sqrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sqrape - Simple Query Scraping with CSS and Go Reflection

by Cathal Garvey, ©2016, Released under the GNU AGPLv3

Go Report Card

What

When scraping web content, one usually hopes that the content is laid out logically, and that proper or at least consistent web annotation exists. This means well-nested HTML, appropriate use of tags, descriptive CSS classes and unique CSS IDs. Ideally it also means that a given CSS selector will yield a consistent datatype, also.

In such cases, it's possible to define exactly what you want using only CSS and a type. For a scraping job, then, it would be ideal to just make a struct defining the content you want, and to scrape a page directly from that, right?

So, something like this:

type Tweet struct {
	Author  string `csss:"div.original-tweet;attr=data-screen-name"`
	TweetID int64  `csss:"div.original-tweet;attr=data-tweet-id"`
	Content string `csss:"p.js-tweet-text;text"`
}

type TwitterProfile struct {
	Tweets []Tweet `csss:"li.js-stream-item;obj"`
}

func main() {
	resp, _ := http.Get("https://twitter.com/onetruecathal")
	tp := new(TwitterProfile)
	csstostruct.ExtractHTMLReader(resp.Body, tp)
	for _, tweet := range tp.Tweets {
		fmt.Printf("@%s: %s\n", tweet.Author, tweet.Content)
	}
}

..well that's Sqrape. In fact, see examples/tweetgrab.go for the above as a CLI tool.

Note; that struct tag is csss, not css. It's "css selector", because I didn't want to clobber any preexisting css struct tag libs that may exist.

How?

Basics

Sqrape uses struct tags to figure out how to access and extract data. These tags consist of two portions; a CSS selector, and a data extractor, separated by a semicolon.. The former are an exercise for the reader and are well documented. CSS selectors are passed to goquery, under the hood, so consult docs there if in doubt.

One difference from goquery: Empty selectors are OK, and indicate "extract data from the entire selection"; these are more commonly useful for embedded structs or slices, where the passed data may be ready for extraction and require no further CSS searching.

The second portion simply indicates what part or form of the selected data is desired, and can take four forms, three of which are trivial:

  • text: The text contents of matched data are returned.
  • html: The HTML contents of matched data are returned
  • attr=<attribute name>: Extract the value of an attribute on the matched selection.
  • obj: This indicates a struct or array field that is parsed recursively.

Therefore, to extract the data-foo value from a div, use csss:"div[data-foo];attr=data-foo": this selects any div with a data-foo attribute, and extracts the value of that attribute.

To extract values other than strings, simply set the field type in your struct to the desired type; this magic is handled by mapstructure! So, if data-foo is a number, then the field the above tag annotates can be an int or int64.

If your field is a singleton, then the first value will be extracted in the case of attributes, and the concatenation of all values in the case of text or HTML. If your field is a slice, then the values will be added iteratively from the goquery selection.

If your field is a struct or slice of structs, then the extractor portion of the tag should be obj, to indicate that parsing data from extracted structs should be deferred to the embedded struct fields. See the Twitter example, above.

More Advanced: Optional Methods

Sometimes a datatype needs to be filled from multiple sources, or has fields that should only be filled under certain other conditions, or should have conditional or context-aware behaviour... for this, you can define optional methods that alter Sqrape's behaviour and allow you to selectively fill fields, or to perform post-processing or post-scrape data validation on your struct.

The methods supported so far include:

  • SqrapeFieldSelect(fieldName string, context...interface{}) (doField bool, cancelScrape error)
  • SqrapePostFlight(context... interface{}) error

The context argument in either case is a variadic list of arbitrary datatypes which are passed by you to the entrypoint functions when operating a scrape.

So, for example, you could implement multi-page scraping by passing the current URL to your scrape and defining a SqrapeFieldSelect method that fills fields only for relevant URLs.

Or, you could perform data validation on your expected output with a SqrapePostFlight method, either with hardcoded regex/validation or by passing per-job regex or callbacks. Any error you raise in PostFlight will be returned from the job to you.

What's Supported?

Nested structs and array fields containing either basic values or struct values. This means that, aside from map-fields, most stuff should just work. File an Issue for cases that fail and I'll try to get it working.

Take a look at the test cases for an example of what works well. Feel free to volunteer test cases.

What's Not Supported?

Pointer fields! If your field has a nested struct as a pointer, right now it will crash, and for reasons unknown to me you'll get no informative error while panic-catching is enabled in the entrypoint functions. I'm working on a fix that will initially just abort informatively on pointer fields, and later will work.

Credits Reel

Obviously, GoQuery deserves a huge slice of the credit for this.

A lot of the magic behind field-filling is thanks to mapstructure, which handles "weakly typed" field-filling for structs.

There's a lot of reflective magic in this code; right now that's predictably messy and due re-writing in pure reflect code. Meanwhile, thanks to structs and reflections for tiding me over this much of the project, by offering handy higher-level abstractions for reflect.

Reflection may give you the shivers; you're right, this code is potentially explosive right now! Caveat emptor. However, the entry point functions do have a blanket-recover deferred, so this code shouldn't panic, merely return an error on panicky behaviour. Please report any panic errors you encounter, to help me make this more stable.

Why?

I scrape content a lot. Weekly, sometimes daily, as part of my job or for personal research. Web scraping is just another way of consuming web content! I do most of my scraping in the IPython shell, but for something "important" I'll write something more permanent and use that whenever the need arises.

For this, one typically uses a scraping framework. But, permanence has disadvantages. If your scraping framework requires a lot of overhead for very basic tasks, then that means the maintenance burden when things change is also high.

I wanted something where creating and maintaining a scraper could be trivial, a matter of just defining the data I want and mapping it to the HTML. If or when the HTML changes, then I only need to change the datatypes or CSS rules and get back to using the data.

Releases

No releases published

Packages

No packages published

Languages