Grouper: Convert web pages and news searches to RSS Blog Plugin for Grouper Evolution
The extendable version of Grouper
 
Blog is a plugin bundled with Grouper Evolution. The default configuration works with blogs containing the comment tags as noted below to mark their different parts. By changing the configuration settings, it can extract information from many other pages as well.
 
Installation:
To install Blog, put blog.php into the "plugins" folder inside the folder containing grouper.php.
 
Use:
To tell Grouper Evolution to use Blog, enter commands like the following into your webpage after "require_once '/path/to/grouper.php';" (changing "antone.geckotribe.com" to the desired domain name and "/mentalarc/" to the path to the blog):
 
     GrouperLoadPlugin('blog.php');
     GrouperSourceConf('searchdomain','antone.geckotribe.com');
     GrouperSourceConf('querystart','/mentalarc/');
 
Configuration:
By default, Blog reads the page at the root of the domain indicated (for example, http://antone.geckotribe.com/mentalarc/) and uses the comment tags as noted below to break it down into individual items. You may change this behavior by overriding the default values using the function GrouperSourceConf, as follows:
     GrouperSourceConf('OptionNameFromBelow','new value');
 
Blog has the following options:
  • searchdomain: The domain name of the blog or other page you wish to scrape (for example, 'antone.geckotribe.com').
  • querystart: The path to the page you wish to scrape. This value MUST begin with '/'. If the path is to a directory and the document contains relative links, it must end with '/' for the links to be processed correctly. Note that this applies only to link fields, not to links in the description text (which are not altered by this plugin).
  • maxidesc: The maximum number of characters to include in the item description. Any additional characters will be discarded.
  • atruncidesc: The text to add after an item description that has been truncated by the maxidesc setting.
  • encoding: The character encoding of the page (and thus of the newsfeed). You can usually leave this as it is.
  • channeltitle: The default title for your RSS channel. If a title is successfully extracted from the page, it will override this value.
  • channeldescription: The default description for your RSS channel. If a description is successfully extracted from the page, it will override this value.
  • cfields: The channel fields (fields that apply to the overall page, as opposed to the individual items) to look for in the page and include, if found, in the RSS feed. Supported values are title and description.
  • ifields: The item fields (fields for each individual item) to look for in the page and include, if found, in the RSS feed. Supported values are: title, description, datetime, date, time, author and link. NOTE: Use datetime if the page contains timestamps including both the date and time. If the date and time are listed in separate locations, use date and time instead. Blog will combine them into a single pugDate field in the RSS feed.
The rest of the options tell Blog what to look for on the page to locate the different parts of the newsfeed. By default, these are comment tags which must be added to the blog template. If you wish to scrape a blog that does not include these tags, you will need to study the HTML source for the blog and find other tags or text that can be used to locate each item. For example, take a look at the source of The Electric Eel's Mental Arc. The default value for each is show in italics:
  • tossbefore: <!-- GrouperStart --> If this value is not blank, anything appearing before it on the page will be discarded.
  • tossafter: <!-- GrouperEnd --> If this value is not blank, anything appearing after it on the page will be discarded.
  • channelstart & channelend: <!-- Heading --> & <!-- /Heading --> Everything appearing between these values will be searched for the channel information.
  • itemsstart & itemsend: <!-- Blog Posts --> & <!-- /Blog Posts --> Everything appearing between these values will be searched for ALL of the individual items.
  • itemstart & itemend: <!-- Item --> & <!-- /Item --> Everything appearing between these values will be searched for the fields in EACH individual item.
  • ctitlestart & ctitleend: <!-- Title --> & <!-- /Title --> Everything appearing between these values (within the channelstart/channelend section) will be used as the channel title.
  • cdescriptionstart & cdescriptionend: <!-- Description --> & <!-- /Description --> Everything appearing between these values (within the channelstart/channelend section) will be used as the channel description.
  • ititlestart & ititleend: <!-- Title --> & <!-- /Title --> Everything appearing between these values (within an individual itemstart/itemend section) will be used as that item's title.
  • idescriptionstart & idescriptionend: <!-- Description --> & <!-- /Description --> Everything appearing between these values (within an individual itemstart/itemend section) will be used as that item's description.
  • iauthorstart & iauthorend: <!-- Author --> & <!-- /Author --> Everything appearing between these values (within an individual itemstart/itemend section) will be used as that item's author.
  • ilinkstart & ilinkend: <!-- Link --> & <!-- /Link --> Everything appearing between these values (within an individual itemstart/itemend section) will be used as that item's link.
  • idatetimestart & idatetimeend: <!-- DateTime --> & <!-- /DateTime --> Everything appearing between these values (within an individual itemstart/itemend section) will be used as that item's pubDate. If DateTime, Date and Time are all found, DateTime will override Date and Time.
  • idatestart & idateend: <!-- Date --> & <!-- /Date --> Everything appearing between these values (within an individual itemstart/itemend section) will be used as that item's date, which will be combined with the time to generate the item's pubDate. If DateTime is found, it will override Date and Time.
  • itimestart & itimeend: <!-- Time --> & <!-- /Time --> Everything appearing between these values (within an individual itemstart/itemend section) will be used as that item's time, which will be combined with the date to generate the item's pubDate. If DateTime is found, it will override Date and Time.