Extracting meta-data from pages
Jump to navigation
Jump to search
Goal
Pancake currently only looks at page urls and titles. Investigate what we can find out about pages by looking at their content. Example are:
- Page structure. headings, article text, etc.
- Meta tags: icons, authors, etc.
- Embedded micro formats like recipes, contacts, geo-information, etc.
We should find out how easy is it to find and extract this information and see if a big enough number of pages has useful information that we can do something with it.
How we use the extracted information for generic results and maybe very domain-specific results like for example "people", "recipes", "locations".