c# - Regex Extract html Body - Stack Overflow

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

How about something like this?

It captures everything between <body></body> tags (case insensitive due to RegexOptions.IgnoreCase ) into a group named theBody .

RegexOptions.Singleline allows us to handle multiline HTML as a single string.

If the HTML does not contain <body></body> tags, the Success property of the match will be false.

        string html;
        // Populate the html string here
        RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
        Regex regx = new Regex( "<body>(?<theBody>.*)</body>", options );
        Match match = regx.Match( html );
        if ( match.Success ) {
            string theBody = match.Groups["theBody"].Value;
                A good simple solution, but beware of body tags with spaces or attributes: < body id='content'> would not match
– Quango
                Dec 3, 2013 at 16:17
  This is an agile HTML parser that
  builds a read/write DOM and supports
  plain XPATH or XSLT (you actually
  don't HAVE to understand XPATH nor
  XSLT to use it, don't worry...). It is
  a .NET code library that allows you to
  parse "out of the web" HTML files. The
  parser is very tolerant with "real
  world" malformed HTML. The object
  model is very similar to what proposes
  System.Xml, but for HTML documents (or
  streams).
Then you can extract the body with an XPATH.
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.