Extracting Useful Text From Html
Ever wondered how search engines figure out text summaries from HTML pages ? Here is an easy way to do it.
- Parse and store the HTML on a per-line basis.
- Compute the text density of each line by calculating the ratio of text to bytes.
- Higher the text density, more important the text.