Class HtmlParser

  • All Implemented Interfaces:
    org.xml.sax.XMLReader
    Direct Known Subclasses:
    InfosetCoercingHtmlParser

    public class HtmlParser
    extends java.lang.Object
    implements org.xml.sax.XMLReader
    This class implements an HTML5 parser that exposes data through the SAX2 interface.

    By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.

    By default, this parser doesn't do true streaming but buffers everything first. The parser can be made truly streaming by calling setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL). This has the consequence that errors that require non-streamable recovery are treated as fatal.

    By default, in order to make the parse events emulate the parse events for a DTDless XML document, the parser does not report the doctype through LexicalHandler. Doctype reporting through LexicalHandler can be turned on by calling setReportingDoctype(true).

    Version:
    $Id$
    Author:
    hsivonen
    • Constructor Detail

      • HtmlParser

        public HtmlParser()
        Instantiates the parser with a fatal XML violation policy.
      • HtmlParser

        public HtmlParser​(XmlViolationPolicy xmlPolicy)
        Instantiates the parser with a specific XML violation policy.
        Parameters:
        xmlPolicy - the policy
    • Method Detail

      • getContentHandler

        public org.xml.sax.ContentHandler getContentHandler()
        Specified by:
        getContentHandler in interface org.xml.sax.XMLReader
        See Also:
        XMLReader.getContentHandler()
      • getDTDHandler

        public org.xml.sax.DTDHandler getDTDHandler()
        Specified by:
        getDTDHandler in interface org.xml.sax.XMLReader
        See Also:
        XMLReader.getDTDHandler()
      • getEntityResolver

        public org.xml.sax.EntityResolver getEntityResolver()
        Specified by:
        getEntityResolver in interface org.xml.sax.XMLReader
        See Also:
        XMLReader.getEntityResolver()
      • getErrorHandler

        public org.xml.sax.ErrorHandler getErrorHandler()
        Specified by:
        getErrorHandler in interface org.xml.sax.XMLReader
        See Also:
        XMLReader.getErrorHandler()
      • getFeature

        public boolean getFeature​(java.lang.String name)
                           throws org.xml.sax.SAXNotRecognizedException,
                                  org.xml.sax.SAXNotSupportedException
        Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReader getters directly.
        http://xml.org/sax/features/external-general-entities
        false
        http://xml.org/sax/features/external-parameter-entities
        false
        http://xml.org/sax/features/is-standalone
        true
        http://xml.org/sax/features/lexical-handler/parameter-entities
        false
        http://xml.org/sax/features/namespaces
        true
        http://xml.org/sax/features/namespace-prefixes
        false
        http://xml.org/sax/features/resolve-dtd-uris
        true
        http://xml.org/sax/features/string-interning
        false
        http://xml.org/sax/features/unicode-normalization-checking
        isCheckingNormalization
        http://xml.org/sax/features/use-attributes2
        false
        http://xml.org/sax/features/use-locator2
        false
        http://xml.org/sax/features/use-entity-resolver2
        false
        http://xml.org/sax/features/validation
        false
        http://xml.org/sax/features/xmlns-uris
        false
        http://xml.org/sax/features/xml-1.1
        false
        http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
        isHtml4ModeCompatibleWithXhtml1Schemata
        http://validator.nu/features/mapping-lang-to-xml-lang
        isMappingLangToXmlLang
        http://validator.nu/features/scripting-enabled
        isScriptingEnabled
        Specified by:
        getFeature in interface org.xml.sax.XMLReader
        Parameters:
        name - feature URI string
        Returns:
        a value per the list above
        Throws:
        org.xml.sax.SAXNotRecognizedException
        org.xml.sax.SAXNotSupportedException
        See Also:
        XMLReader.getFeature(java.lang.String)
      • getProperty

        public java.lang.Object getProperty​(java.lang.String name)
                                     throws org.xml.sax.SAXNotRecognizedException,
                                            org.xml.sax.SAXNotSupportedException
        Allows XMLReader-level access to non-boolean valued getters.

        The properties are mapped as follows:

        http://xml.org/sax/properties/document-xml-version
        "1.0"
        http://xml.org/sax/properties/lexical-handler
        getLexicalHandler
        http://validator.nu/properties/content-space-policy
        getContentSpacePolicy
        http://validator.nu/properties/content-non-xml-char-policy
        getContentNonXmlCharPolicy
        http://validator.nu/properties/comment-policy
        getCommentPolicy
        http://validator.nu/properties/xmlns-policy
        getXmlnsPolicy
        http://validator.nu/properties/name-policy
        getNamePolicy
        http://validator.nu/properties/streamability-violation-policy
        getStreamabilityViolationPolicy
        http://validator.nu/properties/document-mode-handler
        getDocumentModeHandler
        http://validator.nu/properties/doctype-expectation
        getDoctypeExpectation
        http://xml.org/sax/features/unicode-normalization-checking
        Specified by:
        getProperty in interface org.xml.sax.XMLReader
        Parameters:
        name - property URI string
        Returns:
        a value per the list above
        Throws:
        org.xml.sax.SAXNotRecognizedException
        org.xml.sax.SAXNotSupportedException
        See Also:
        XMLReader.getProperty(java.lang.String)
      • parse

        public void parse​(org.xml.sax.InputSource input)
                   throws java.io.IOException,
                          org.xml.sax.SAXException
        Specified by:
        parse in interface org.xml.sax.XMLReader
        Throws:
        java.io.IOException
        org.xml.sax.SAXException
        See Also:
        XMLReader.parse(org.xml.sax.InputSource)
      • parseFragment

        public void parseFragment​(org.xml.sax.InputSource input,
                                  java.lang.String context)
                           throws java.io.IOException,
                                  org.xml.sax.SAXException
        Parses a fragment.
        Parameters:
        input - the input to parse
        context - the name of the context element
        Throws:
        java.io.IOException
        org.xml.sax.SAXException
      • parse

        public void parse​(java.lang.String systemId)
                   throws java.io.IOException,
                          org.xml.sax.SAXException
        Specified by:
        parse in interface org.xml.sax.XMLReader
        Throws:
        java.io.IOException
        org.xml.sax.SAXException
        See Also:
        XMLReader.parse(java.lang.String)
      • setContentHandler

        public void setContentHandler​(org.xml.sax.ContentHandler handler)
        Specified by:
        setContentHandler in interface org.xml.sax.XMLReader
        See Also:
        XMLReader.setContentHandler(org.xml.sax.ContentHandler)
      • setLexicalHandler

        public void setLexicalHandler​(org.xml.sax.ext.LexicalHandler handler)
        Sets the lexical handler.
        Parameters:
        handler - the hander.
      • setDTDHandler

        public void setDTDHandler​(org.xml.sax.DTDHandler handler)
        Specified by:
        setDTDHandler in interface org.xml.sax.XMLReader
        See Also:
        XMLReader.setDTDHandler(org.xml.sax.DTDHandler)
      • setEntityResolver

        public void setEntityResolver​(org.xml.sax.EntityResolver resolver)
        Specified by:
        setEntityResolver in interface org.xml.sax.XMLReader
        See Also:
        XMLReader.setEntityResolver(org.xml.sax.EntityResolver)
      • setErrorHandler

        public void setErrorHandler​(org.xml.sax.ErrorHandler handler)
        Specified by:
        setErrorHandler in interface org.xml.sax.XMLReader
        See Also:
        XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
      • setTransitionHandler

        public void setTransitionHandler​(TransitionHandler handler)
      • setTreeBuilderErrorHandlerOverride

        public void setTreeBuilderErrorHandlerOverride​(org.xml.sax.ErrorHandler handler)
        Deprecated.
        For Validator.nu internal use
        See Also:
        XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
      • setFeature

        public void setFeature​(java.lang.String name,
                               boolean value)
                        throws org.xml.sax.SAXNotRecognizedException,
                               org.xml.sax.SAXNotSupportedException
        Sets a boolean feature without having to use non-XMLReader setters directly.

        The supported features are:

        http://xml.org/sax/features/unicode-normalization-checking
        setCheckingNormalization
        http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
        setHtml4ModeCompatibleWithXhtml1Schemata
        http://validator.nu/features/mapping-lang-to-xml-lang
        setMappingLangToXmlLang
        http://validator.nu/features/scripting-enabled
        setScriptingEnabled
        Specified by:
        setFeature in interface org.xml.sax.XMLReader
        Throws:
        org.xml.sax.SAXNotRecognizedException
        org.xml.sax.SAXNotSupportedException
        See Also:
        XMLReader.setFeature(java.lang.String, boolean)
      • setProperty

        public void setProperty​(java.lang.String name,
                                java.lang.Object value)
                         throws org.xml.sax.SAXNotRecognizedException,
                                org.xml.sax.SAXNotSupportedException
        Sets a non-boolean property without having to use non-XMLReader setters directly.
        http://xml.org/sax/properties/lexical-handler
        setLexicalHandler
        http://validator.nu/properties/content-space-policy
        setContentSpacePolicy
        http://validator.nu/properties/content-non-xml-char-policy
        setContentNonXmlCharPolicy
        http://validator.nu/properties/comment-policy
        setCommentPolicy
        http://validator.nu/properties/xmlns-policy
        setXmlnsPolicy
        http://validator.nu/properties/name-policy
        setNamePolicy
        http://validator.nu/properties/streamability-violation-policy
        setStreamabilityViolationPolicy
        http://validator.nu/properties/document-mode-handler
        setDocumentModeHandler
        http://validator.nu/properties/doctype-expectation
        setDoctypeExpectation
        http://validator.nu/properties/xml-policy
        setXmlPolicy
        Specified by:
        setProperty in interface org.xml.sax.XMLReader
        Throws:
        org.xml.sax.SAXNotRecognizedException
        org.xml.sax.SAXNotSupportedException
        See Also:
        XMLReader.setProperty(java.lang.String, java.lang.Object)
      • isCheckingNormalization

        public boolean isCheckingNormalization()
        Indicates whether NFC normalization of source is being checked.
        Returns:
        true if NFC normalization of source is being checked.
        See Also:
        nu.validator.htmlparser.impl.Tokenizer#isCheckingNormalization()
      • setCheckingNormalization

        public void setCheckingNormalization​(boolean enable)
        Toggles the checking of the NFC normalization of source.
        Parameters:
        enable - true to check normalization
        See Also:
        nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)
      • isScriptingEnabled

        public boolean isScriptingEnabled()
        Whether the parser considers scripting to be enabled for noscript treatment.
        Returns:
        true if enabled
        See Also:
        TreeBuilder.isScriptingEnabled()
      • setScriptingEnabled

        public void setScriptingEnabled​(boolean scriptingEnabled)
        Sets whether the parser considers scripting to be enabled for noscript treatment.
        Parameters:
        scriptingEnabled - true to enable
        See Also:
        TreeBuilder.setScriptingEnabled(boolean)
      • getDoctypeExpectation

        public DoctypeExpectation getDoctypeExpectation()
        Returns the doctype expectation.
        Returns:
        the doctypeExpectation
      • getDocumentModeHandler

        public DocumentModeHandler getDocumentModeHandler()
        Returns the document mode handler.
        Returns:
        the documentModeHandler
      • getStreamabilityViolationPolicy

        public XmlViolationPolicy getStreamabilityViolationPolicy()
        Returns the streamabilityViolationPolicy.
        Returns:
        the streamabilityViolationPolicy
      • setStreamabilityViolationPolicy

        public void setStreamabilityViolationPolicy​(XmlViolationPolicy streamabilityViolationPolicy)
        Sets the streamabilityViolationPolicy.
        Parameters:
        streamabilityViolationPolicy - the streamabilityViolationPolicy to set
      • setHtml4ModeCompatibleWithXhtml1Schemata

        public void setHtml4ModeCompatibleWithXhtml1Schemata​(boolean html4ModeCompatibleWithXhtml1Schemata)
        Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
        Parameters:
        html4ModeCompatibleWithXhtml1Schemata -
      • getDocumentLocator

        public org.xml.sax.Locator getDocumentLocator()
        Returns the Locator during parse.
        Returns:
        the Locator
      • isHtml4ModeCompatibleWithXhtml1Schemata

        public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
        Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
        Returns:
        the html4ModeCompatibleWithXhtml1Schemata
      • setMappingLangToXmlLang

        public void setMappingLangToXmlLang​(boolean mappingLangToXmlLang)
        Whether lang is mapped to xml:lang.
        Parameters:
        mappingLangToXmlLang -
        See Also:
        Tokenizer.setMappingLangToXmlLang(boolean)
      • isMappingLangToXmlLang

        public boolean isMappingLangToXmlLang()
        Whether lang is mapped to xml:lang.
        Returns:
        the mappingLangToXmlLang
      • getXmlnsPolicy

        public XmlViolationPolicy getXmlnsPolicy()
        Returns the xmlnsPolicy.
        Returns:
        the xmlnsPolicy
      • getLexicalHandler

        public org.xml.sax.ext.LexicalHandler getLexicalHandler()
        Returns the lexicalHandler.
        Returns:
        the lexicalHandler
      • getCommentPolicy

        public XmlViolationPolicy getCommentPolicy()
        Returns the commentPolicy.
        Returns:
        the commentPolicy
      • getContentNonXmlCharPolicy

        public XmlViolationPolicy getContentNonXmlCharPolicy()
        Returns the contentNonXmlCharPolicy.
        Returns:
        the contentNonXmlCharPolicy
      • getContentSpacePolicy

        public XmlViolationPolicy getContentSpacePolicy()
        Returns the contentSpacePolicy.
        Returns:
        the contentSpacePolicy
      • isReportingDoctype

        public boolean isReportingDoctype()
        Returns the reportingDoctype.
        Returns:
        the reportingDoctype
      • setErrorProfile

        public void setErrorProfile​(java.util.HashMap<java.lang.String,​java.lang.String> errorProfileMap)
        Parameters:
        errorProfile -
        See Also:
        nu.validator.htmlparser.impl.errorReportingTokenizer#setErrorProfile(set)
      • setHeuristics

        public void setHeuristics​(Heuristics heuristics)
        Sets the encoding sniffing heuristics.
        Parameters:
        heuristics - the heuristics to set
        See Also:
        nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)
      • getHeuristics

        public Heuristics getHeuristics()
      • setXmlPolicy

        public void setXmlPolicy​(XmlViolationPolicy xmlPolicy)
        This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.
        Parameters:
        xmlPolicy -
      • getNamePolicy

        public XmlViolationPolicy getNamePolicy()
        The policy for non-NCName element and attribute names.
        Returns:
        the namePolicy
      • setBogusXmlnsPolicy

        public void setBogusXmlnsPolicy​(XmlViolationPolicy bogusXmlnsPolicy)
        Deprecated.
        Does nothing.
      • getBogusXmlnsPolicy

        public XmlViolationPolicy getBogusXmlnsPolicy()
        Deprecated.
        Returns XmlViolationPolicy.ALTER_INFOSET.
        Returns:
        XmlViolationPolicy.ALTER_INFOSET
      • addCharacterHandler

        public void addCharacterHandler​(CharacterHandler characterHandler)