Extracting meta-data from pages

From MozillaWiki
Jump to navigation Jump to search

Goal

Pancake currently only looks at page urls and titles. Investigate what we can find out about pages by looking at their content. Example are:

  • Page structure. headings, article text, etc.
  • Meta tags: icons, authors, etc.
  • Embedded micro formats like recipes, contacts, geo-information, etc.

We should find out how easy is it to find and extract this information and see if a big enough number of pages has useful information that we can do something with it.

How we use the extracted information for generic results and maybe very domain-specific results like for example "people", "recipes", "locations".