1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
|
#pragma once
#include "utils.hpp"
#include <stdexcept>
#include <string>
#include <cstring>
#include <tidy/tidy.h>
#include <tidy/tidybuffio.h>
#include <pugixml.hpp>
using namespace pugi;
/**
* Represents an HTML document.
*
* Upon construction it will convert the given HTML using _tidy_, then feed it
* to pugixml for parsing.
*
* This parser is mostly useful for feed readers, so it only provides very
* little information and it's not suitable as a full-fledged HTML parser.
*
* Values are parsed on the fly when requested, this is mostly to avoid
* unnecessary overhead trying to parse unneeded information ahead of time.
*
* In case some value cannot be found, it will just contain an empty string.
*/
class Html {
private:
xml_document doc;
xml_node head;
std::string title{""};
std::string icon_url{""};
std::string img_url{""};
std::string rss_url{""};
std::string description{""};
std::string article{""};
std::string body{""};
/**
* Applies a default configuration set to a TidyDoc.
*/
static void configure_tidy_doc(TidyDoc &doc);
/**
* Returns a TidyDoc given a valid file path.
*/
TidyDoc tidy_doc_from_file(std::string path);
/**
* Converts a TidyDoc document to XML, and returns it as a string.
*/
std::string convert_to_xml(TidyDoc doc);
static inline const std::vector<std::string> USELESS_CHILDREN = {
"script", "form", "input", "label", "nav", "footer", "header"
};
/**
* Removes children that are deemed useless for the information this class
* needs to parse.
*/
void remove_useless_children(xml_node &root);
/**
* Constructs an Html object from a TidyDoc document.
*/
Html(TidyDoc &tdoc);
/**
* Returns the `body` node from the current xml_document.
*/
xml_node get_body_node();
public:
/**
* Constructs the Html object from a valid file path.
*
* @param path a valid file path to a local HTML document.
*/
Html(std::string path);
/**
* Constructs the Html object from a string containing valid HTML.
*
* @param s a string containing the HTML to parse
*/
static Html from_string(std::string s);
std::string get_title();
std::string get_icon_url();
std::string get_img_url();
std::string get_rss_url();
std::string get_body();
std::string get_article();
std::string get_description();
std::string to_json(bool metadata_only=false);
};
|