Extracting meta-data from pages

From MozillaWiki
Revision as of 14:42, 19 June 2012 by St3fan (talk | contribs) (Created page with "= Goal = Pancake currently only looks at page urls and titles. Investigate what we can find out about pages by looking at their content. Example are: * Page structure. headings...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Goal

Pancake currently only looks at page urls and titles. Investigate what we can find out about pages by looking at their content. Example are:

  • Page structure. headings, article text, etc.
  • Meta tags: icons, authors, etc.
  • Embedded micro formats like recipes, contacts, geo-information, etc.

We should find out how easy is it to find and extract this information and see if a big enough number of pages has useful information that we can do something with it.

How we use the extracted information for generic results and maybe very domain-specific results like for example "people", "recipes", "locations".