The debut of Substance: A HTML-to-Markdown extractor

In the past week, I’ve been creating a tool to extract the main content of the current web page and convert it to Markdown for archiving purposes. Currently, I’ve finished a Web app as a Proof-of-Concept or a preview version before the final release. Its only feature is to extract and download Wikipedia articles to markdown files. So here’s the link:

substance.reorx.com

To put it simply, the goal of the product is to be an alternative to MarkDownload with more extensibility. MarkDownload has been an excellent help for archiving content from the web, but it does not always work well on every website. Every now and then, I found it gives bad results for some websites such as Wikipedia (that’s why I take it as an example to work on at the very beginning).

After releasing this web app, I’ll focus on developing the extension and writing documents for the product. The code is open-sourced here though it has no README by far, but you can give me your feedback on the issues or reply here if you like.

In the next post, I’ll give a detailed introduction about what Substance really is and how it works.

Until next time, don’t forget to subscribe to my fresh-made newsletter to get the latest update.