diff options
author | Andreas Baumann <abaumann@yahoo.com> | 2014-06-14 20:15:59 +0200 |
---|---|---|
committer | Andreas Baumann <abaumann@yahoo.com> | 2014-06-14 20:15:59 +0200 |
commit | 913e4215f22e16ad90a30b7e68e8cd2165c6812d (patch) | |
tree | d7aef8f6e7b29895f1b0160cb647e5427181198e /textwolf | |
parent | 4f6d08ce39cc430ed7ba90d143bf7af3fc8ca6d5 (diff) | |
download | crawler-913e4215f22e16ad90a30b7e68e8cd2165c6812d.tar.gz crawler-913e4215f22e16ad90a30b7e68e8cd2165c6812d.tar.bz2 |
added textwolf and a test for it
Diffstat (limited to 'textwolf')
26 files changed, 6340 insertions, 0 deletions
diff --git a/textwolf/.gitignore b/textwolf/.gitignore new file mode 100644 index 0000000..9bfe015 --- /dev/null +++ b/textwolf/.gitignore @@ -0,0 +1,9 @@ +src/ +tests/readStdinIterator +tests/readStdinIterator.o +tests/test_TextReader +tests/test_TextReader.o +tests/test_XMLPathSelect +tests/test_XMLPathSelect.o +tests/test_XMLScanner +tests/test_XMLScanner.o diff --git a/textwolf/README b/textwolf/README new file mode 100644 index 0000000..66f1e0a --- /dev/null +++ b/textwolf/README @@ -0,0 +1,28 @@ +Documentation +* For using textwolf just include "include/textwolf.hpp". +* textwolf can be compiled with the highest optimization level, specially with deep inline expansion +* The textwolf home is at at http://textwolf.net +* A textwolf introduction can be found at http://textwolf.net/tutorial.html +* A doxygen interface documentation is at http://patrickfrey.github.com/textwolf/html/index.html + +Examples +* See the examples in tests: +** readStdinIterator.cpp :Echoing stdin for all character sets +** test_TextReader.cpp :Iterating on a set of generated characters and test if read/write works for all characters +** test_XMLPathSelect.cpp :Iterating on the XML Path selected elements +** test_XMLScanner.cpp :Iterating on the XML elements + +Projects using textwolf +* textwolf is used in the wolframe project (see http://wolframe.net or http://github.com/Wolframe/Wolframe) + +Bugreports +* textwolf bug reports are for the time beeing collected as CSV file in BUGS. + +Project Schedule +* 2014/06/12 + Version 0.2.0 + * Support of IsoLatin codepages besides UTF/UCS encodings + * Chunk by chunk feeding reimplementation (using longjmp instead of exceptions) + + + diff --git a/textwolf/include/textwolf.hpp b/textwolf/include/textwolf.hpp new file mode 100644 index 0000000..8a71bde --- /dev/null +++ b/textwolf/include/textwolf.hpp @@ -0,0 +1,57 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ + +#ifndef __TEXTWOLF_HPP__ +#define __TEXTWOLF_HPP__ +/// \file textwolf.hpp +/// \brief Main include file + +#include "textwolf/char.hpp" +#include "textwolf/exception.hpp" +#include "textwolf/staticbuffer.hpp" +#include "textwolf/charset_interface.hpp" +#include "textwolf/charset.hpp" +#include "textwolf/textscanner.hpp" +#include "textwolf/xmlscanner.hpp" +#include "textwolf/cstringiterator.hpp" +#include "textwolf/sourceiterator.hpp" +#include "textwolf/xmltagstack.hpp" +#include "textwolf/xmlprinter.hpp" +#include "textwolf/xmlhdrparser.hpp" +#include "textwolf/xmlpathselect.hpp" + +#endif + + diff --git a/textwolf/include/textwolf/char.hpp b/textwolf/include/textwolf/char.hpp new file mode 100644 index 0000000..419cc24 --- /dev/null +++ b/textwolf/include/textwolf/char.hpp @@ -0,0 +1,143 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/char.hpp +/// \brief Definition of unicode characters +#ifndef __TEXTWOLF_CHAR_HPP__ +#define __TEXTWOLF_CHAR_HPP__ +#include <cstddef> + +#ifdef BOOST_VERSION +#include <boost/cstdint.hpp> +namespace textwolf { + /// \typedef UChar + /// \brief Unicode character type + typedef boost::uint32_t UChar; + typedef boost::uint64_t EChar; +}//namespace +#else +#ifdef _MSC_VER +#pragma warning(disable:4290) +#include <BaseTsd.h> +namespace textwolf { + /// \typedef UChar + /// \brief Unicode character type + typedef DWORD32 UChar; + typedef DWORD64 EChar; +}//namespace +#else +#include <stdint.h> +namespace textwolf { + /// \typedef UChar + /// \brief Unicode character type + typedef uint32_t UChar; + typedef uint64_t EChar; +}//namespace +#endif +#endif + +namespace textwolf { +/// \class CharMap +/// \brief Character map for fast typing of a character byte +/// \tparam RESTYPE result type of the map +/// \tparam nullvalue_ default intitialization value of the map +/// \tparam RANGE domain of the input values of the map +template <typename RESTYPE, RESTYPE nullvalue_, int RANGE=256> +class CharMap +{ +public: + typedef RESTYPE valuetype; + enum Constant {nullvalue=nullvalue_}; + +private: + RESTYPE ar[ RANGE]; //< the map elements +public: + /// \brief Constructor + CharMap() {for (unsigned int ii=0; ii<RANGE; ii++) ar[ii]=(valuetype)nullvalue;} + /// \brief Define the values of the elements in the interval [from,to] + /// \param[in] from start of the input intervall (belongs also to the input) + /// \param[in] to end of the input intervall (belongs also to the input) + /// \param[in] value value assigned to all elements in [from,to] + CharMap& operator()( unsigned char from, unsigned char to, valuetype value) {for (unsigned int ii=from; ii<=to; ii++) ar[ii]=value; return *this;} + /// \brief Define the values of the single element at 'at' + /// \param[in] at the input element + /// \param[in] value value assigned to the element 'at' + CharMap& operator()( unsigned char at, valuetype value) {ar[at] = value; return *this;} + /// \brief Read the element assigned to 'ii' + /// \param[in] ii the input element queried + /// \return the element at 'ii' + valuetype operator []( unsigned char ii) const {return ar[ii];} +}; + +/// \enum ControlCharacter +/// \brief Enumeration of control characters needed as events for XML scanner statemachine +enum ControlCharacter +{ + Undef=0, //< not defined (beyond ascii) + EndOfText, //< end of data (EOF,EOD,.) + EndOfLine, //< end of line + Cntrl, //< control character + Space, //< space, tab, etc.. + Amp, //< ampersant ('&') + Lt, //< lesser than '<' + Equal, //< equal '=' + Gt, //< greater than '>' + Slash, //< slash '/' + Dash, //< en dash (minus) '-' + Exclam, //< exclamation mark '!' + Questm, //< question mark '?' + Sq, //< single quote + Dq, //< double quote + Osb, //< open square bracket '[' + Csb, //< close square bracket ']' + Any //< any ascii character with meaning +}; +enum {NofControlCharacter=18}; //< total number of control characters + +/// \class ControlCharacterM +/// \brief Map of the enumeration of control characters to their names for debug messages +struct ControlCharacterM +{ + /// \brief Get the name of a control character as string + /// \param [in] c the control character to map + static const char* name( ControlCharacter c) + { + static const char* name[ NofControlCharacter] = {"Undef", "EndOfText", "EndOfLine", "Cntrl", "Space", "Amp", "Lt", "Equal", "Gt", "Slash", "Dash", "Exclam", "Questm", "Sq", "Dq", "Osb", "Csb", "Any"}; + return name[ (unsigned int)(unsigned char)c]; + } +}; + +}//namespace +#endif + diff --git a/textwolf/include/textwolf/charset.hpp b/textwolf/include/textwolf/charset.hpp new file mode 100644 index 0000000..93ac5c3 --- /dev/null +++ b/textwolf/include/textwolf/charset.hpp @@ -0,0 +1,46 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this Object refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/charset.hpp +/// \brief Character set encodings already implemented in textwolf +/// \note The interface that the classes defined in the files included must fulfill is defined in "charset_interface.hpp" + +#ifndef __TEXTWOLF_XML_CHARSET_HPP__ +#define __TEXTWOLF_XML_CHARSET_HPP__ +#include "textwolf/charset_utf8.hpp" +#include "textwolf/charset_utf16.hpp" +#include "textwolf/charset_ucs.hpp" +#include "textwolf/charset_isolatin.hpp" +#endif + diff --git a/textwolf/include/textwolf/charset_interface.hpp b/textwolf/include/textwolf/charset_interface.hpp new file mode 100644 index 0000000..b99bdf7 --- /dev/null +++ b/textwolf/include/textwolf/charset_interface.hpp @@ -0,0 +1,146 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/charset_interface.hpp +/// \brief Interface that describes what a character set encoding implementation has to define to be used as character set template parameter for textwolf. +/// \remark This interface is more a documentation because the template library relies on the properties of the character set classes rather than on the interface it implements. +#ifndef __TEXTWOLF_CHARSET_INTERFACE_HPP__ +#define __TEXTWOLF_CHARSET_INTERFACE_HPP__ +#include <cstddef> +#include "textwolf/staticbuffer.hpp" + +namespace textwolf { +/// \namespace textwolf::charset +/// \brief namespace of character set encoding definitions +namespace charset { + +/// \class Encoder +/// \brief Collection of functions for encode/decode XML character entities +struct Encoder +{ + /// \brief Write the character 'chr' in encoded form as nul-terminated string to a buffer + /// \param[in] chr unicode character to encode + /// \param[out] bufptr buffer to write to + /// \param[in] bufsize allocation size of buffer pointer by 'bufptr' + static bool encode( UChar chr, char* bufptr, std::size_t bufsize) + { + static const char* HEX = "0123456789abcdef"; + StaticBuffer buf( bufptr, bufsize); + char bb[ 32]; + unsigned int ii=0; + while (chr > 0) + { + bb[ii++] = HEX[ chr & 0xf]; + chr /= 16; + } + buf.push_back( '&'); + buf.push_back( '#'); + buf.push_back( 'x'); + while (ii) + { + buf.push_back( bb[ --ii]); + } + buf.push_back( ';'); + buf.push_back( '\0'); + return !buf.overflow(); + } +}; + +/// \class Interface +/// \brief This interface has to be implemented for a character set encoding +struct Interface +{ + /// \brief Maximum character this characer set encoding can represent + enum {MaxChar=0xFF}; + + /// \brief Skip to start of the next character + /// \param [in] buf buffer for the character data + /// \param [in,out] bufpos position in 'buf' + /// \param [in,out] itr iterator to skip + template <class Iterator> + static void skip( char* buf, unsigned int& bufpos, Iterator& itr); + + /// \brief Fetches the ascii char representation of the current character + /// \param [in] buf buffer for the parses character data + /// \param [in,out] bufpos position in 'buf' + /// \param [in,out] itr iterator on the source + /// \return the value of the ascii character or -1 + template <class Iterator> + static signed char asciichar( char* buf, unsigned int& bufpos, Iterator& itr); + + /// \brief Fetches the bytes of the current character into a buffer + /// \param [in] buf buffer for the parses character data + /// \param [in,out] bufpos position in 'buf' + /// \param [in,out] itr iterator on the source + template <class Iterator> + static void fetchbytes( char* buf, unsigned int& bufpos, Iterator& itr); + + /// \brief Fetches the unicode character representation of the current character + /// \param [in] buf buffer for the parses character data + /// \param [in,out] bufpos position in 'buf' + /// \param [in,out] itr iterator on the source + /// \return the value of the unicode character + template <class Iterator> + UChar value( char* buf, unsigned int& bufpos, Iterator& itr) const; + + /// \brief Prints a unicode character to a buffer + /// \tparam Buffer_ STL back insertion sequence + /// \param [in] chr character to print + /// \param [out] buf buffer to print to + template <class Buffer_> + void print( UChar chr, Buffer_& buf) const; + + /// \brief Evaluate if two character set encodings of the same type are equal in all properties (code page, etc.) + /// \return true if yes + static bool is_equal( const Interface&, const Interface&) + { + return true; + } +}; + +/// \class ByteOrder +/// \brief Order of bytes for wide char character sets +struct ByteOrder +{ + enum + { + LE=0, //< little endian + BE=1 //< big endian + }; +}; + +}//namespace +}//namespace +#endif + diff --git a/textwolf/include/textwolf/charset_isolatin.hpp b/textwolf/include/textwolf/charset_isolatin.hpp new file mode 100644 index 0000000..b6bd660 --- /dev/null +++ b/textwolf/include/textwolf/charset_isolatin.hpp @@ -0,0 +1,126 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/charset_isolatin.hpp +/// \brief Definition of IsoLatin encodings + +#ifndef __TEXTWOLF_CHARSET_ISOLATIN_HPP__ +#define __TEXTWOLF_CHARSET_ISOLATIN_HPP__ +#include "textwolf/char.hpp" +#include "textwolf/charset_interface.hpp" +#include "textwolf/exception.hpp" +#include "textwolf/codepages.hpp" +#include <cstddef> + +namespace textwolf { +namespace charset { + +/// \class IsoLatin +/// \brief Character set IsoLatin-1,..IsoLatin-9 (ISO-8859-1,...ISO-8859-9) +struct IsoLatin :public IsoLatinCodePage +{ + enum {MaxChar=0xFF}; + + IsoLatin( const IsoLatin& o) + :IsoLatinCodePage(o){} + IsoLatin( unsigned int codePageIdx=1) + :IsoLatinCodePage(codePageIdx){} + + /// \brief See template<class Iterator>Interface::skip(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void skip( char*, unsigned int& bufpos, Iterator& itr) + { + if (bufpos==0) + { + ++itr; + ++bufpos; + } + } + + /// \brief See template<class Iterator>Interface::fetchbytes(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void fetchbytes( char* buf, unsigned int& bufpos, Iterator& itr) + { + if (bufpos==0) + { + buf[0] = *itr; + ++itr; + ++bufpos; + } + } + + /// \brief See template<class Iterator>Interface::asciichar(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline signed char asciichar( char* buf, unsigned int& bufpos, Iterator& itr) + { + fetchbytes( buf, bufpos, itr); + return ((unsigned char)(buf[0])>127)?-1:buf[0]; + } + + /// \brief See template<class Iterator>Interface::value(char*,unsigned int&,Iterator&) + template <class Iterator> + inline UChar value( char* buf, unsigned int& bufpos, Iterator& itr) const + { + fetchbytes( buf, bufpos, itr); + return ucharcode( buf[0]); + } + + /// \brief See template<class Buffer>Interface::print(UChar,Buffer&) + template <class Buffer_> + void print( UChar chr, Buffer_& buf) const + { + char chr_ = invcode( chr); + if (chr_ == 0) + { + char tb[ 32]; + char* cc = tb; + Encoder::encode( chr, tb, sizeof(tb)); + while (*cc) buf.push_back( *cc++); + } + else + { + buf.push_back( chr_); + } + } + + /// \brief See template<class Buffer>Interface::is_equal( const Interface&, const Interface&) + static inline bool is_equal( const IsoLatin& a, const IsoLatin& b) + { + return IsoLatinCodePage::is_equal( a, b); + } +}; + +}//namespace +}//namespace +#endif diff --git a/textwolf/include/textwolf/charset_ucs.hpp b/textwolf/include/textwolf/charset_ucs.hpp new file mode 100644 index 0000000..22f2cab --- /dev/null +++ b/textwolf/include/textwolf/charset_ucs.hpp @@ -0,0 +1,238 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/charset_ucs.hpp +/// \brief Definition of UCS-2/UCS-4 encodings + +#ifndef __TEXTWOLF_CHARSET_UCS_HPP__ +#define __TEXTWOLF_CHARSET_UCS_HPP__ +#include "textwolf/char.hpp" +#include "textwolf/charset_interface.hpp" +#include "textwolf/exception.hpp" +#include <cstddef> + +namespace textwolf { +namespace charset { + +/// \class UCS2 +/// \brief Character set UCS-2 (little/big endian) +/// \tparam byteorder charset::ByteOrder::LE or charset::ByteOrder::BE +/// UCS-2 encoding is defined to be big-endian only. Although the similar designations 'UCS-2BE and UCS-2LE +/// imitate the UTF-16 labels, they do not represent official encoding schemes. (http://en.wikipedia.org/wiki/UTF-16/UCS-2) +/// therefore we take byteorder=ByteOrder::BE as default. +template <int byteorder=ByteOrder::BE> +struct UCS2 +{ + enum + { + LSB=(byteorder==ByteOrder::BE), //< least significant byte index (0 or 1) + MSB=(byteorder==ByteOrder::LE), //< most significant byte index (0 or 1) + Print1shift=(byteorder==ByteOrder::BE)?8:0, //< value to shift with to get the 1st character to print + Print2shift=(byteorder==ByteOrder::LE)?8:0, //< value to shift with to get the 2nd character to print + MaxChar=0xFFFF + }; + + /// \brief See template<class Iterator>Interface::skip(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void skip( char*, unsigned int& bufpos, Iterator& itr) + { + for (;bufpos < 2; ++bufpos) + { + ++itr; + } + } + + /// \brief See template<class Iterator>Interface::fetchbytes(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void fetchbytes( char* buf, unsigned int& bufpos, Iterator& itr) + { + if (bufpos<2) + { + if (bufpos<1) + { + buf[0] = *itr; + ++itr; + ++bufpos; + } + buf[1] = *itr; + ++itr; + ++bufpos; + } + } + + template <class Iterator> + static inline UChar value_impl( char* buf, unsigned int& bufpos, Iterator& itr) + { + fetchbytes( buf, bufpos, itr); + UChar res = (unsigned char)buf[MSB]; + return (res << 8) + (unsigned char)buf[LSB]; + } + + /// \brief See template<class Iterator>Interface::value(char*,unsigned int&,Iterator&) + template <class Iterator> + inline UChar value( char* buf, unsigned int& bufpos, Iterator& itr) const + { + return value_impl( buf, bufpos, itr); + } + + /// \brief See template<class Iterator>Interface::value(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline signed char asciichar( char* buf, unsigned int& bufpos, Iterator& itr) + { + UChar ch = value_impl( buf, bufpos, itr); + return (ch > 127)?-1:(char)ch; + } + + /// \brief See template<class Buffer>Interface::print(UChar,Buffer&) + template <class Buffer_> + inline void print( UChar chr, Buffer_& buf) const + { + if (chr>MaxChar) + { + char tb[ 32]; + char* cc = tb; + Encoder::encode( chr, tb, sizeof(tb)); + while (*cc) + { + buf.push_back( (UChar)*cc >> Print1shift); + buf.push_back( (UChar)*cc >> Print2shift); + ++cc; + } + } + else + { + buf.push_back( chr >> Print1shift); + buf.push_back( chr >> Print2shift); + } + } + + /// \brief See template<class Buffer>Interface::is_equal( const Interface&, const Interface&) + static inline bool is_equal( const UCS2&, const UCS2&) + { + return true; + } +}; + +/// \class UCS4 +/// \brief Character set UCS-4 (little/big endian) +/// \tparam byteorder ByteOrder::LE or ByteOrder::BE +template <int byteorder> +struct UCS4 +{ + enum + { + B0=(byteorder==ByteOrder::BE)?3:0, + B1=(byteorder==ByteOrder::BE)?2:1, + B2=(byteorder==ByteOrder::BE)?1:2, + B3=(byteorder==ByteOrder::BE)?0:3, + Print1shift=(byteorder==ByteOrder::BE)?24:0, //< value to shift with to get the 1st character to print + Print2shift=(byteorder==ByteOrder::BE)?16:8, //< value to shift with to get the 2nd character to print + Print3shift=(byteorder==ByteOrder::BE)?8:16, //< value to shift with to get the 3rd character to print + Print4shift=(byteorder==ByteOrder::BE)?0:24, //< value to shift with to get the 4th character to print + MaxChar=0xFFFFFFFF + }; + + /// \brief See template<class Iterator>Interface::fetchbytes(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void fetchbytes( char* buf, unsigned int& bufpos, Iterator& itr) + { + for (;bufpos < 4; ++bufpos) + { + buf[ bufpos] = *itr; + ++itr; + } + } + + /// \brief See template<class Iterator>Interface::value(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline UChar value( char* buf, unsigned int& bufpos, Iterator& itr) + { + fetchbytes( buf, bufpos, itr); + UChar res = (unsigned char)buf[B3]; + res = (res << 8) + (unsigned char)buf[B2]; + res = (res << 8) + (unsigned char)buf[B1]; + return (res << 8) + (unsigned char)buf[B0]; + } + + /// \brief See template<class Iterator>Interface::skip(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void skip( char*, unsigned int& bufpos, Iterator& itr) + { + for (;bufpos < 4; ++bufpos) + { + ++itr; + } + } + + /// \brief See template<class Iterator>Interface::asciichar(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline signed char asciichar( char* buf, unsigned int& bufpos, Iterator& itr) + { + UChar ch = value( buf, bufpos, itr); + return (ch > 127)?-1:(char)ch; + } + + /// \brief See template<class Buffer>Interface::print(UChar,Buffer&) + template <class Buffer_> + static void print( UChar chr, Buffer_& buf) + { + buf.push_back( (unsigned char)((chr >> Print1shift) & 0xFF)); + buf.push_back( (unsigned char)((chr >> Print2shift) & 0xFF)); + buf.push_back( (unsigned char)((chr >> Print3shift) & 0xFF)); + buf.push_back( (unsigned char)((chr >> Print4shift) & 0xFF)); + } + + /// \brief See template<class Buffer>Interface::is_equal( const Interface&, const Interface&) + static inline bool is_equal( const UCS4&, const UCS4&) + { + return true; + } +}; + +/// \class UCS2LE +/// \brief UCS-2 little endian character set encoding +struct UCS2LE :public UCS2<ByteOrder::LE> {}; +/// \class UCS2BE +/// \brief UCS-2 big endian character set encoding +struct UCS2BE :public UCS2<ByteOrder::BE> {}; +/// \class UCS4LE +/// \brief UCS-4 little endian character set encoding +struct UCS4LE :public UCS4<ByteOrder::LE> {}; +/// \class UCS4BE +/// \brief UCS-4 big endian character set encoding +struct UCS4BE :public UCS4<ByteOrder::BE> {}; + +}//namespace +}//namespace +#endif diff --git a/textwolf/include/textwolf/charset_utf16.hpp b/textwolf/include/textwolf/charset_utf16.hpp new file mode 100644 index 0000000..576c202 --- /dev/null +++ b/textwolf/include/textwolf/charset_utf16.hpp @@ -0,0 +1,224 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/charset_utf16.hpp +/// \brief Definition of UTF-16 encodings + +#ifndef __TEXTWOLF_CHARSET_UTF16_HPP__ +#define __TEXTWOLF_CHARSET_UTF16_HPP__ +#include "textwolf/char.hpp" +#include "textwolf/charset_interface.hpp" +#include "textwolf/exception.hpp" +#include <cstddef> + +namespace textwolf { +namespace charset { + +/// \class UTF16 +/// \brief Character set UTF16 (little/big endian) +/// \tparam encoding ByteOrder::LE or ByteOrder::BE +/// \remark BOM character sequences are not interpreted as such and byte swapping is not done implicitely +/// It is left to the caller to detect BOM or its inverse and to switch the iterator. +/// \remark See http://en.wikipedia.org/wiki/UTF-16/UCS-2: ... If the endian architecture of the decoder +/// matches that of the encoder, the decoder detects the 0xFEFF value, but an opposite-endian decoder +/// interprets the BOM as the non-character value U+FFFE reserved for this purpose. This incorrect +/// result provides a hint to perform byte-swapping for the remaining values. If the BOM is missing, +/// the standard says that big-endian encoding should be assumed.... +template <int encoding=ByteOrder::BE> +class UTF16 +{ +private: + enum + { + LSB=(encoding==ByteOrder::BE), //< least significant byte index (0 or 1) + MSB=(encoding==ByteOrder::LE), //< most significant byte index (0 or 1) + Print1shift=(encoding==ByteOrder::BE)?8:0, //< value to shift with to get the 1st character to print + Print2shift=(encoding==ByteOrder::LE)?8:0 //< value to shift with to get the 2nd character to print + }; + +public: + enum + { + MaxChar=0x10FFFF //< maximum character in alphabet + }; + +public: + /// \brief See template<class Iterator>Interface::fetchbytes(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void fetchbytes( char* buf, unsigned int& bufpos, Iterator& itr) + { + if (bufpos<2) + { + if (bufpos<1) + { + buf[0] = *itr; + ++itr; + ++bufpos; + } + buf[1] = *itr; + ++itr; + ++bufpos; + } + } + + /// \brief Get the size of the current character in bytes (variable length encoding) + /// \param [in] buf buffer for the character data + /// \param [in,out] bufpos position in 'buf' + /// \param [in,out] itr iterator + template <class Iterator> + static inline unsigned int size( char* buf, unsigned int& bufpos, Iterator& itr) + { + fetchbytes( buf, bufpos, itr); + + UChar rt = (unsigned char)buf[ MSB]; + if ((rt - 0xD8) > 0x03) + { + return 2; + } + else + { + return 4; + } + } + + /// \brief See template<class Iterator>Interface::skip(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void skip( char* buf, unsigned int& bufpos, Iterator& itr) + { + unsigned int bufsize = size( buf, bufpos, itr); + for (;bufpos < bufsize; ++bufpos) + { + ++itr; + } + } + + /// \brief See template<class Iterator>Interface::asciichar(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline signed char asciichar( char* buf, unsigned int& bufpos, Iterator& itr) + { + UChar ch = value_impl( buf, bufpos, itr); + return (ch > 127)?-1:(char)ch; + } + + /// \brief See template<class Iterator>Interface::value(char*,unsigned int&,Iterator&) + template <class Iterator> + static UChar value_impl( char* buf, unsigned int& bufpos, Iterator& itr) + { + unsigned int bufsize = size( buf, bufpos, itr); + + UChar rt = (unsigned char)buf[ MSB]; + rt = (rt << 8) + (unsigned char)buf[ LSB]; + + if (bufsize == 4) + { + // 2 teilig + while (bufpos < bufsize) + { + buf[bufpos] = *itr; + ++itr; + ++bufpos; + } + rt -= 0xD800; + rt *= 0x400; + unsigned short lo = (unsigned char)buf[ 2+MSB]; + if ((lo - 0xDC) > 0x03) return 0xFFFF; + lo = (lo << 8) + (unsigned char)buf[ 2+LSB]; + return rt + lo - 0xDC00 + 0x010000; + } + return rt; + } + + template <class Iterator> + inline UChar value( char* buf, unsigned int& bufpos, Iterator& itr) const + { + return value_impl( buf, bufpos, itr); + } + + /// \brief See template<class Buffer>Interface::print(UChar,Buffer&) + template <class Buffer_> + void print( UChar ch, Buffer_& buf) const + { + if (ch <= 0xFFFF) + { + if ((ch - 0xD800) < 0x400) + { + //... reserved for encoding of characters in range [0xFFFF..0x10FFFF] + } + else + { + buf.push_back( (char)(unsigned char)((ch >> Print1shift) & 0xFF)); + buf.push_back( (char)(unsigned char)((ch >> Print2shift) & 0xFF)); + return; + } + } + else if (ch <= 0x10FFFF) + { + ch -= 0x10000; + unsigned short hi = (ch / 0x400) + 0xD800; + unsigned short lo = (ch % 0x400) + 0xDC00; + buf.push_back( (char)(unsigned char)((hi >> Print1shift) & 0xFF)); + buf.push_back( (char)(unsigned char)((hi >> Print2shift) & 0xFF)); + buf.push_back( (char)(unsigned char)((lo >> Print1shift) & 0xFF)); + buf.push_back( (char)(unsigned char)((lo >> Print2shift) & 0xFF)); + return; + } + char tb[ 32]; + char* cc = tb; + Encoder::encode( ch, tb, sizeof(tb)); + while (*cc) + { + buf.push_back( (char)(unsigned char)(((UChar)*cc >> Print1shift) & 0xFF)); + buf.push_back( (char)(unsigned char)(((UChar)*cc >> Print2shift) & 0xFF)); + ++cc; + } + } + + /// \brief See template<class Buffer>Interface::is_equal( const Interface&, const Interface&) + static inline bool is_equal( const UTF16&, const UTF16&) + { + return true; + } +}; + +/// \class UTF16LE +/// \brief UTF-16 little endian character set encoding +struct UTF16LE :public UTF16<ByteOrder::LE> {}; +/// \class UTF16BE +/// \brief UTF-16 big endian character set encoding +struct UTF16BE :public UTF16<ByteOrder::BE> {}; + +}//namespace +}//namespace +#endif + diff --git a/textwolf/include/textwolf/charset_utf8.hpp b/textwolf/include/textwolf/charset_utf8.hpp new file mode 100644 index 0000000..f31277a --- /dev/null +++ b/textwolf/include/textwolf/charset_utf8.hpp @@ -0,0 +1,218 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/charset_utf8.hpp +/// \brief Definition of UTF-8 encoding + +#ifndef __TEXTWOLF_CHARSET_UTF8_HPP__ +#define __TEXTWOLF_CHARSET_UTF8_HPP__ +#include "textwolf/char.hpp" +#include "textwolf/charset_interface.hpp" +#include "textwolf/exception.hpp" +#include <cstddef> + +namespace textwolf { +namespace charset { + +/// \class UTF8 +/// \brief character set encoding UTF-8 +struct UTF8 +{ + /// \brief Maximum character that can be represented by this encoding implementation + enum {MaxChar=0x7FFFFFFF}; + enum { + B11111111=0xFF, + B01111111=0x7F, + B00111111=0x3F, + B00011111=0x1F, + B00001111=0x0F, + B00000111=0x07, + B00000011=0x03, + B00000001=0x01, + B00000000=0x00, + B10000000=0x80, + B11000000=0xC0, + B11100000=0xE0, + B11110000=0xF0, + B11111000=0xF8, + B11111100=0xFC, + B11111110=0xFE, + + B11011111=B11000000|B00011111, + B11101111=B11100000|B00001111, + B11110111=B11110000|B00000111, + B11111011=B11111000|B00000011, + B11111101=B11111100|B00000001 + }; + + /// \class CharLengthTab + /// \brief Table that maps the first UTF-8 character byte to the length of the character in bytes + struct CharLengthTab :public CharMap<unsigned char, 0> + { + CharLengthTab() + { + (*this) + (B00000000,B01111111,1) + (B11000000,B11011111,2) + (B11100000,B11101111,3) + (B11110000,B11110111,4) + (B11111000,B11111011,5) + (B11111100,B11111101,6) + (B11111110,B11111110,7) + (B11111111,B11111111,8); + }; + }; + + /// \brief Get the size of the current character in bytes (variable length encoding) + /// \param [in] buf buffer for the character data + /// \param [in,out] bufpos position in 'buf' + /// \param [in,out] itr iterator to skip + template <class Iterator> + static inline unsigned int size( char* buf, unsigned int& bufpos, Iterator& itr) + { + static CharLengthTab charLengthTab; + if (bufpos==0) + { + buf[0] = *itr; + ++itr; + ++bufpos; + } + return charLengthTab[ (unsigned char)buf[ 0]]; + } + + /// \brief See template<class Iterator>Interface::skip(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void skip( char* buf, unsigned int& bufpos, Iterator& itr) + { + unsigned int bufsize = size( buf, bufpos, itr); + for (;bufpos < bufsize; ++bufpos) + { + ++itr; + } + } + + /// \brief See template<class Iterator>Interface::asciichar(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline signed char asciichar( char* buf, unsigned int& bufpos, Iterator& itr) + { + if (bufpos==0) + { + buf[0] = *itr; + ++itr; + ++bufpos; + } + return ((unsigned char)(buf[0])>127)?-1:buf[0]; + } + + /// \brief See template<class Iterator>Interface::fetch(char*,unsigned int&,Iterator&) + template <class Iterator> + static inline void fetchbytes( char* buf, unsigned int& bufpos, Iterator& itr) + { + if (bufpos==0) + { + buf[0] = *itr; + ++itr; + ++bufpos; + } + unsigned int bufsize = size( buf, bufpos, itr); + for (;bufpos < bufsize; ++bufpos) + { + buf[ bufpos] = *itr; + ++itr; + } + } + + /// \brief See template<class Iterator>Interface::value(char*,unsigned int&,Iterator&) + template <class Iterator> + UChar value( char* buf, unsigned int& bufpos, Iterator& itr) const + { + fetchbytes( buf, bufpos, itr); + + UChar res = (unsigned char)buf[0]; + if (res > 127) + { + int gg = bufpos-2; + if (gg < 0) return MaxChar; + + res = ((unsigned char)buf[0])&(B00011111>>gg); + for (int ii=0; ii<=gg; ii++) + { + unsigned char xx = (unsigned char)buf[ii+1]; + res = (res<<6) | (xx & B00111111); + if ((unsigned char)(xx & B11000000) != B10000000) + { + return MaxChar; + } + } + } + return res; + } + + /// \brief See template<class Buffer>Interface::print(UChar,Buffer&) + template <class Buffer_> + void print( UChar chr, Buffer_& buf) const + { + unsigned int rt; + if (chr <= 127) + { + buf.push_back( (char)(unsigned char)chr); + return; + } + unsigned int pp,sf; + for (pp=1,sf=5; pp<5; pp++,sf+=5) + { + if (chr < (unsigned int)((1<<6)<<sf)) break; + } + rt = pp+1; + unsigned char HB = (unsigned char)(B11111111 << (8-rt)); + unsigned char shf = (unsigned char)(pp*6); + unsigned int ii; + buf.push_back( (char)(((unsigned char)(chr >> shf) & (~HB >> 1)) | HB)); + for (ii=1,shf-=6; ii<=pp; shf-=6,ii++) + { + buf.push_back( (char)(unsigned char) (((chr >> shf) & B00111111) | B10000000)); + } + } + + /// \brief See template<class Buffer>Interface::is_equal( const Interface&, const Interface&) + static bool is_equal( const UTF8&, const UTF8&) + { + return true; + } +}; + +}//namespace +}//namespace +#endif + diff --git a/textwolf/include/textwolf/codepages.hpp b/textwolf/include/textwolf/codepages.hpp new file mode 100644 index 0000000..4e8e7cf --- /dev/null +++ b/textwolf/include/textwolf/codepages.hpp @@ -0,0 +1,182 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/codepages.hpp +/// \brief Definition of IsoLatin code pages + +#ifndef __TEXTWOLF_CODE_PAGES_HPP__ +#define __TEXTWOLF_CODE_PAGES_HPP__ +#include "textwolf/char.hpp" +#include <map> + +namespace textwolf { +namespace charset { + +/// \class IsoLatinCodePage +/// \brief IsoLatin code page +class IsoLatinCodePage +{ +private: + struct InvOvlCodeMap + { + InvOvlCodeMap() + { + struct Element + { + unsigned short first; + unsigned char second; + }; + struct ElementAr + { + Element ar[ 64]; + }; + static const ElementAr ovlar[9] = + { + {{{0,0}}}, + {{{260,161}, {728,162}, {321,163}, {317,165}, {346,166}, {352,169}, {350,170}, {356,171}, {377,172}, {381,174}, {379,175}, {261,177}, {731,178}, {322,179}, {318,181}, {347,182}, {711,183}, {353,185}, {351,186}, {357,187}, {378,188}, {733,189}, {382,190}, {380,191}, {340,192}, {258,195}, {313,197}, {262,198}, {268,200}, {280,202}, {282,204}, {270,207}, {272,208}, {323,209}, {327,210}, {336,213}, {344,216}, {366,217}, {368,219}, {354,222}, {341,224}, {259,227}, {314,229}, {263,230}, {269,232}, {281,234}, {283,236}, {271,239}, {273,240}, {324,241}, {328,242}, {337,245}, {345,248}, {367,249}, {369,251}, {355,254}, {729,255}, {0,0}}}, + {{{294,161}, {728,162}, {292,165}, {304,168}, {350,169}, {286,170}, {308,171}, {379,173}, {295,175}, {293,180}, {305,183}, {351,184}, {287,185}, {309,186}, {380,188}, {266,193}, {264,194}, {288,208}, {284,211}, {364,216}, {348,217}, {267,223}, {265,224}, {289,238}, {285,241}, {365,246}, {349,247}, {729,248}, {0,0}}}, + {{{260,161}, {312,162}, {342,163}, {296,165}, {315,166}, {352,169}, {274,170}, {290,171}, {358,172}, {381,174}, {261,177}, {731,178}, {343,179}, {297,181}, {316,182}, {711,183}, {353,185}, {275,186}, {291,187}, {359,188}, {330,189}, {382,190}, {331,191}, {256,192}, {302,199}, {268,200}, {280,202}, {278,204}, {298,207}, {272,208}, {325,209}, {332,210}, {310,211}, {370,217}, {360,221}, {362,222}, {257,224}, {303,231}, {269,232}, {281,234}, {279,236}, {299,239}, {273,240}, {326,241}, {333,242}, {311,243}, {371,249}, {361,253}, {363,254}, {729,255}, {0,0}}}, + {{{286,208}, {304,221}, {350,222}, {287,240}, {305,253}, {351,254}, {0,0}}}, + {{{260,161}, {274,162}, {290,163}, {298,164}, {296,165}, {310,166}, {315,168}, {272,169}, {352,170}, {358,171}, {381,172}, {362,174}, {330,175}, {261,177}, {275,178}, {291,179}, {299,180}, {297,181}, {311,182}, {316,184}, {273,185}, {353,186}, {359,187}, {382,188}, {8213,189}, {363,190}, {331,191}, {256,192}, {302,199}, {268,200}, {280,202}, {278,204}, {325,209}, {332,210}, {360,215}, {370,217}, {257,224}, {303,231}, {269,232}, {281,234}, {279,236}, {326,241}, {333,242}, {361,247}, {371,249}, {312,255}, {0,0}}}, + {{{8221,161}, {8222,165}, {342,170}, {8220,180}, {343,186}, {260,192}, {302,193}, {256,194}, {262,195}, {280,198}, {274,199}, {268,200}, {377,202}, {278,203}, {290,204}, {310,205}, {298,206}, {315,207}, {352,208}, {323,209}, {325,210}, {332,212}, {370,216}, {321,217}, {346,218}, {362,219}, {379,221}, {381,222}, {261,224}, {303,225}, {257,226}, {263,227}, {281,230}, {275,231}, {269,232}, {378,234}, {279,235}, {291,236}, {311,237}, {299,238}, {316,239}, {353,240}, {324,241}, {326,242}, {333,244}, {371,248}, {322,249}, {347,250}, {363,251}, {380,253}, {382,254}, {8217,255}, {0,0}}}, + {{{7682,161}, {7683,162}, {266,164}, {267,165}, {7690,166}, {7808,168}, {7810,170}, {7691,171}, {7922,172}, {376,175}, {7710,176}, {7711,177}, {288,178}, {289,179}, {7744,180}, {7745,181}, {7766,183}, {7809,184}, {7767,185}, {7811,186}, {7776,187}, {7923,188}, {7812,189}, {7813,190}, {7777,191}, {372,208}, {7786,215}, {374,222}, {373,240}, {7787,247}, {375,254}, {0,0}}}, + {{{8364,164}, {352,166}, {353,168}, {381,180}, {382,184}, {338,188}, {339,189}, {376,190}, {0,0}}} + }; + unsigned int idx = 0; + for (; idx < 9; ++idx) + { + unsigned int ii = 0; + for (; ovlar[idx].ar[ii].first; ++ii) + { + m_map[idx][ ovlar[idx].ar[ii].first] = ovlar[idx].ar[ii].second; + } + } + } + + inline const std::map<unsigned short, unsigned char>* get( unsigned int idx) const + { + return &m_map[ idx]; + } + private: + std::map<unsigned short, unsigned char> m_map[9]; + }; + +public: + /// \brief Copy constructor + IsoLatinCodePage( const IsoLatinCodePage& o) + :m_cd(o.m_cd) + ,m_invcd(o.m_invcd) + ,m_invovlcd(o.m_invovlcd){} + + /// \brief Constructor + /// \param[in] idx IsoLatin code page index, 1 for "IsoLatin-1" + IsoLatinCodePage( unsigned int idx) + { + enum {NofCodePages=9}; + struct CodePage + { + unsigned short ar[128]; + }; + static const CodePage codePage[ NofCodePages] = { + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 260, 728, 321, 164, 317, 346, 167, 168, 352, 350, 356, 377, 173, 381, 379, 176, 261, 731, 322, 180, 318, 347, 711, 184, 353, 351, 357, 378, 733, 382, 380, 340, 193, 194, 258, 196, 313, 262, 199, 268, 201, 280, 203, 282, 205, 206, 270, 272, 323, 327, 211, 212, 336, 214, 215, 344, 366, 218, 368, 220, 221, 354, 223, 341, 225, 226, 259, 228, 314, 263, 231, 269, 233, 281, 235, 283, 237, 238, 271, 273, 324, 328, 243, 244, 337, 246, 247, 345, 367, 250, 369, 252, 253, 355, 729}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 294, 728, 163, 164, 292, 167, 168, 304, 350, 286, 308, 173, 379, 176, 295, 178, 179, 180, 181, 293, 183, 184, 305, 351, 287, 309, 189, 380, 192, 193, 194, 196, 266, 264, 199, 200, 201, 202, 203, 204, 205, 206, 207, 209, 210, 211, 212, 288, 214, 215, 284, 217, 218, 219, 220, 364, 348, 223, 224, 225, 226, 228, 267, 265, 231, 232, 233, 234, 235, 236, 237, 238, 239, 241, 242, 243, 244, 289, 246, 247, 285, 249, 250, 251, 252, 365, 349, 729}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 260, 312, 342, 164, 296, 315, 167, 168, 352, 274, 290, 358, 173, 381, 175, 176, 261, 731, 343, 180, 297, 316, 711, 184, 353, 275, 291, 359, 330, 382, 331, 256, 193, 194, 195, 196, 197, 198, 302, 268, 201, 280, 203, 278, 205, 206, 298, 272, 325, 332, 310, 212, 213, 214, 215, 216, 370, 218, 219, 220, 360, 362, 223, 257, 225, 226, 227, 228, 229, 230, 303, 269, 233, 281, 235, 279, 237, 238, 299, 273, 326, 333, 311, 244, 245, 246, 247, 248, 371, 250, 251, 252, 361, 363, 729}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 286, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 304, 350, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 287, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 305, 351, 255}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 260, 274, 290, 298, 296, 310, 167, 315, 272, 352, 358, 381, 173, 362, 330, 176, 261, 275, 291, 299, 297, 311, 183, 316, 273, 353, 359, 382, 8213, 363, 331, 256, 193, 194, 195, 196, 197, 198, 302, 268, 201, 280, 203, 278, 205, 206, 207, 208, 325, 332, 211, 212, 213, 214, 360, 216, 370, 218, 219, 220, 221, 222, 223, 257, 225, 226, 227, 228, 229, 230, 303, 269, 233, 281, 235, 279, 237, 238, 239, 240, 326, 333, 243, 244, 245, 246, 361, 248, 371, 250, 251, 252, 253, 254, 312}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 8221, 162, 163, 164, 8222, 166, 167, 216, 169, 342, 171, 172, 173, 174, 198, 176, 177, 178, 179, 8220, 181, 182, 183, 248, 185, 343, 187, 188, 189, 190, 230, 260, 302, 256, 262, 196, 197, 280, 274, 268, 201, 377, 278, 290, 310, 298, 315, 352, 323, 325, 211, 332, 213, 214, 215, 370, 321, 346, 362, 220, 379, 381, 223, 261, 303, 257, 263, 228, 229, 281, 275, 269, 233, 378, 279, 291, 311, 299, 316, 353, 324, 326, 243, 333, 245, 246, 247, 371, 322, 347, 363, 252, 380, 382, 8217}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 7682, 7683, 163, 266, 267, 7690, 167, 7808, 169, 7810, 7691, 7922, 173, 174, 376, 7710, 7711, 288, 289, 7744, 7745, 182, 7766, 7809, 7767, 7811, 7776, 7923, 7812, 7813, 7777, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 372, 209, 210, 211, 212, 213, 214, 7786, 216, 217, 218, 219, 220, 221, 374, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 373, 241, 242, 243, 244, 245, 246, 7787, 248, 249, 250, 251, 252, 253, 375, 255}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 8364, 165, 352, 167, 353, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 381, 181, 182, 183, 382, 185, 186, 187, 338, 339, 376, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255}} + }; + static const CodePage invcodePage[ NofCodePages] = { + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 0, 0, 0, 164, 0, 0, 167, 168, 0, 0, 0, 0, 173, 0, 0, 176, 0, 0, 0, 180, 0, 0, 0, 184, 0, 0, 0, 0, 0, 0, 0, 0, 193, 194, 0, 196, 0, 0, 199, 0, 201, 0, 203, 0, 205, 206, 0, 0, 0, 0, 211, 212, 0, 214, 215, 0, 0, 218, 0, 220, 221, 0, 223, 0, 225, 226, 0, 228, 0, 0, 231, 0, 233, 0, 235, 0, 237, 238, 0, 0, 0, 0, 243, 244, 0, 246, 247, 0, 0, 250, 0, 252, 253, 0, 0}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 0, 0, 163, 164, 0, 0, 166, 167, 0, 0, 0, 0, 172, 0, 0, 174, 0, 176, 177, 178, 179, 0, 181, 182, 0, 0, 0, 0, 187, 0, 0, 189, 190, 191, 0, 192, 0, 0, 195, 196, 197, 198, 199, 200, 201, 202, 203, 0, 204, 205, 206, 207, 0, 209, 210, 0, 212, 213, 214, 215, 0, 0, 218, 219, 220, 221, 0, 222, 0, 0, 225, 226, 227, 228, 229, 230, 231, 232, 233, 0, 234, 235, 236, 237, 0, 239, 240, 0, 242, 243, 244, 245, 0, 0, 0}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 0, 0, 0, 164, 0, 0, 167, 168, 0, 0, 0, 0, 173, 0, 175, 176, 0, 0, 0, 180, 0, 0, 0, 184, 0, 0, 0, 0, 0, 0, 0, 0, 193, 194, 195, 196, 197, 198, 0, 0, 201, 0, 203, 0, 205, 206, 0, 0, 0, 0, 0, 212, 213, 214, 215, 216, 0, 218, 219, 220, 0, 0, 223, 0, 225, 226, 227, 228, 229, 230, 0, 0, 233, 0, 235, 0, 237, 238, 0, 0, 0, 0, 0, 244, 245, 246, 247, 248, 0, 250, 251, 252, 0, 0, 0}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 0, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 0, 0, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 0, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 0, 0, 255}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 0, 0, 0, 0, 0, 0, 167, 0, 0, 0, 0, 0, 173, 0, 0, 176, 0, 0, 0, 0, 0, 0, 183, 0, 0, 0, 0, 0, 0, 0, 0, 0, 193, 194, 195, 196, 197, 198, 0, 0, 201, 0, 203, 0, 205, 206, 207, 208, 0, 0, 211, 212, 213, 214, 0, 216, 0, 218, 219, 220, 221, 222, 223, 0, 225, 226, 227, 228, 229, 230, 0, 0, 233, 0, 235, 0, 237, 238, 239, 240, 0, 0, 243, 244, 245, 246, 0, 248, 0, 250, 251, 252, 253, 254, 0}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 0, 162, 163, 164, 0, 166, 167, 0, 169, 0, 171, 172, 173, 174, 0, 176, 177, 178, 179, 0, 181, 182, 183, 0, 185, 0, 187, 188, 189, 190, 0, 0, 0, 0, 0, 196, 197, 175, 0, 0, 201, 0, 0, 0, 0, 0, 0, 0, 0, 0, 211, 0, 213, 214, 215, 168, 0, 0, 0, 220, 0, 0, 223, 0, 0, 0, 0, 228, 229, 191, 0, 0, 233, 0, 0, 0, 0, 0, 0, 0, 0, 0, 243, 0, 245, 246, 247, 184, 0, 0, 0, 252, 0, 0, 0}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 0, 0, 163, 0, 0, 0, 167, 0, 169, 0, 0, 0, 173, 174, 0, 0, 0, 0, 0, 0, 0, 182, 0, 0, 0, 0, 0, 0, 0, 0, 0, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 0, 209, 210, 211, 212, 213, 214, 0, 216, 217, 218, 219, 220, 221, 0, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 0, 241, 242, 243, 244, 245, 246, 0, 248, 249, 250, 251, 252, 253, 0, 255}}, + {{128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 0, 165, 0, 167, 0, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 0, 181, 182, 183, 0, 185, 186, 187, 0, 0, 0, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255}} + }; + static const InvOvlCodeMap invOvlCodeMap; + + if (idx > NofCodePages || idx == 0) throw std::logic_error( "code page index not supported"); + m_cd = &codePage[ idx-1].ar[0]; + m_invcd = &invcodePage[ idx-1].ar[0]; + m_invovlcd = invOvlCodeMap.get( idx-1); + } + + /// \brief Get the unicode character representation of the character ch in this codepage + /// \param[in] ch character in this codepage + /// \return the unicode representation of the passed character + inline UChar ucharcode( char ch) const + { + if ((signed char)ch >= 0) return ch; + return m_cd[ (unsigned int)(unsigned char)ch - 128]; + } + + /// \brief Get the character representation of a unicode character in this codepage + /// \param[in] ch unicode character + /// \return the representation of the passed unicode character in this codepage + inline char invcode( UChar ch) const + { + char rt = 0; + if (ch <= 128) return ch; + if (ch <= 255) rt = m_invcd[ ch - 128]; + if (rt == 0) + { + std::map<unsigned short, unsigned char>::const_iterator fi = m_invovlcd->find( ch); + if (fi == m_invovlcd->end()) return 0; + rt = fi->second; + } + return rt; + } + + /// \brief Evaluate if two code pages are equal + static inline bool is_equal( const IsoLatinCodePage& a, const IsoLatinCodePage& b) + { + return a.m_cd == b.m_cd; + } + +private: + const unsigned short* m_cd; + const unsigned short* m_invcd; + const std::map<unsigned short, unsigned char>* m_invovlcd; +}; + +}} +#endif + + diff --git a/textwolf/include/textwolf/cstringiterator.hpp b/textwolf/include/textwolf/cstringiterator.hpp new file mode 100644 index 0000000..f2d5c12 --- /dev/null +++ b/textwolf/include/textwolf/cstringiterator.hpp @@ -0,0 +1,120 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this Object refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/cstringiterator.hpp +/// \brief textwolf iterator on strings + +#ifndef __TEXTWOLF_CSTRING_ITERATOR_HPP__ +#define __TEXTWOLF_CSTRING_ITERATOR_HPP__ +#include <string> +#include <cstring> +#include <cstdlib> + +/// \namespace textwolf +/// \brief Toplevel namespace of the library +namespace textwolf { + +/// \class CStringIterator +/// \brief Input iterator on a constant string returning null characters after EOF as required by textwolf scanners +class CStringIterator +{ +public: + /// \brief Default constructor + CStringIterator() + :m_src(0) + ,m_size(0) + ,m_pos(0){} + + /// \brief Constructor + /// \param [in] src null terminated C string to iterate on + /// \param [in] size number of bytes in the string to iterate on + CStringIterator( const char* src, unsigned int size) + :m_src(src) + ,m_size(size) + ,m_pos(0){} + + /// \brief Constructor + /// \param [in] src string to iterate on + CStringIterator( const char* src) + :m_src(src) + ,m_pos(0){m_size=std::strlen(m_src);} + + /// \brief Constructor + /// \param [in] src string to iterate on + CStringIterator( const std::string& src) + :m_src(src.c_str()) + ,m_size(src.size()) + ,m_pos(0){} + + /// \brief Copy constructor + /// \param [in] o iterator to copy + CStringIterator( const CStringIterator& o) + :m_src(o.m_src) + ,m_size(o.m_size) + ,m_pos(o.m_pos){} + + /// \brief Element access + /// \return current character + inline char operator* () + { + return (m_pos < m_size)?m_src[m_pos]:0; + } + + /// \brief Preincrement + inline CStringIterator& operator++() + { + m_pos++; + return *this; + } + + /// \brief Return current char position + inline unsigned int pos() const {return m_pos;} + + /// \brief Set current char position + inline void pos( unsigned int i) {m_pos=(i<m_size)?i:m_size;} + + inline int operator - (const CStringIterator& o) const + { + if (m_src != o.m_src) return 0; + return (int)(m_pos - o.m_pos); + } + +private: + const char* m_src; + unsigned int m_size; + unsigned int m_pos; +}; + +}//namespace +#endif diff --git a/textwolf/include/textwolf/exception.hpp b/textwolf/include/textwolf/exception.hpp new file mode 100644 index 0000000..bf236fe --- /dev/null +++ b/textwolf/include/textwolf/exception.hpp @@ -0,0 +1,106 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/exception.hpp +/// \brief Definition of exceptions with containing error codes thrown by textwolf + +#ifndef __TEXTWOLF_EXCEPTION_HPP__ +#define __TEXTWOLF_EXCEPTION_HPP__ +#include <exception> +#include <stdexcept> + +namespace textwolf { + +/// \class throws_exception +/// \brief Base class for structures that can throw exceptions for non recoverable errors +struct throws_exception +{ + /// \enum Cause + /// \brief Enumeration of error cases + enum Cause + { + Unknown, ///< uknown error + DimOutOfRange, ///< memory reserved for statically allocated table or memory block is too small. Increase the size of memory block passed to the XML path select automaton. Usage error ! + StateNumbersNotAscending, ///< XML scanner automaton definition check failed. Labels of states must be equal to their indices. Internal textwold error ! + InvalidParamState, ///< parameter check (for state) in automaton definition failed. Internal textwold error ! + InvalidParamChar, ///< parameter check (for control character) in automaton definition failed. Internal textwold error ! + DuplicateStateTransition, ///< duplicate transition definition in automaton. Internal textwold error ! + InvalidState, ///< invalid state definition in automaton. Internal textwold error ! + IllegalParam, ///< parameter check in automaton definition failed. Internal textwold error ! + IllegalAttributeName, ///< invalid string for a tag or attribute in the automaton definition. Usage error ! + OutOfMem, ///< out of memory in the automaton definition. System error (std::bad_alloc) ! + ArrayBoundsReadWrite, ///< invalid array access. Internal textwold error ! + NotAllowedOperation ///< defining an operation in an automaton definition that is not allowed there. Usage error ! + }; +}; + +/// \class exception +/// \brief textwolf exception class +struct exception :public std::runtime_error +{ + typedef throws_exception::Cause Cause; + Cause cause; //< exception cause tag + + /// \brief Constructor + /// \return exception object + exception (Cause p_cause) throw() + :std::runtime_error("textwolf error in XML"), cause(p_cause) {} + /// \brief Copy constructor + exception (const exception& orig) throw() + :std::runtime_error("textwolf error in XML"), cause(orig.cause) {} + /// \brief Destructor + virtual ~exception() throw() {} + + /// \brief Assignement + /// \param[in] orig exception to copy + /// \return *this + exception& operator= (const exception& orig) throw() + {cause=orig.cause; return *this;} + + /// \brief Exception message + /// \return exception cause as string + virtual const char* what() const throw() + { + // enumeration of exception causes as strings + static const char* nameCause[ 12] = { + "Unknown","DimOutOfRange","StateNumbersNotAscending","InvalidParamState", + "InvalidParamChar","DuplicateStateTransition","InvalidState","IllegalParam", + "IllegalAttributeName","OutOfMem","ArrayBoundsReadWrite","NotAllowedOperation" + }; + return nameCause[ (unsigned int) cause]; + } +}; + +}//namespace +#endif diff --git a/textwolf/include/textwolf/istreamiterator.hpp b/textwolf/include/textwolf/istreamiterator.hpp new file mode 100644 index 0000000..5a09669 --- /dev/null +++ b/textwolf/include/textwolf/istreamiterator.hpp @@ -0,0 +1,89 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this Object refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/istreamiterator.hpp +/// \brief Definition of iterators for textwolf on STL input streams (std::istream) + +#ifndef __TEXTWOLF_ISTREAM_ITERATOR_HPP__ +#define __TEXTWOLF_ISTREAM_ITERATOR_HPP__ +#include <iostream> +#include <iterator> + +/// \namespace textwolf +/// \brief Toplevel namespace of the library +namespace textwolf { + +/// \class IStreamIterator +/// \brief Input iterator on an STL input stream +class IStreamIterator +{ +public: + /// \brief Default constructor + IStreamIterator(){} + + /// \brief Constructor + /// \param [in] input input to iterate on + IStreamIterator( std::istream& input) + :m_itr(input) + { + input.unsetf( std::ios::skipws); + } + + /// \brief Copy constructor + /// \param [in] o iterator to copy + IStreamIterator( const IStreamIterator& o) + :m_itr(o.m_itr) + ,m_end(o.m_end){} + + /// \brief Element access + /// \return current character + inline char operator* () + { + return (m_itr != m_end)?*m_itr:0; + } + + /// \brief Pre increment + inline IStreamIterator& operator++() + { + ++m_itr; + return *this; + } + +private: + std::istream_iterator<unsigned char> m_itr; + std::istream_iterator<unsigned char> m_end; +}; + +}//namespace +#endif diff --git a/textwolf/include/textwolf/sourceiterator.hpp b/textwolf/include/textwolf/sourceiterator.hpp new file mode 100644 index 0000000..98acfb5 --- /dev/null +++ b/textwolf/include/textwolf/sourceiterator.hpp @@ -0,0 +1,136 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this Object refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/sourceiterator.hpp +/// \brief textwolf byte source iterator template + +#ifndef __TEXTWOLF_SOURCE_ITERATOR_HPP__ +#define __TEXTWOLF_SOURCE_ITERATOR_HPP__ +#include <cstdlib> +#include <stdexcept> +#include <setjmp.h> + +/// \namespace textwolf +/// \brief Toplevel namespace of the library +namespace textwolf { + +/// \class SrcIterator +/// \brief Input iterator as source for the XML scanner with the possibility of being fed chunk by chunk +class SrcIterator +{ +public: + /// \brief Empty constructor + SrcIterator() + :m_itr(0) + ,m_end(0) + ,m_eom(0){} + + /// \brief Copy constructor + /// \param [in] o iterator to copy + SrcIterator( const SrcIterator& o) + :m_itr(o.m_itr) + ,m_end(o.m_end) + ,m_eom(o.m_eom){} + + /// \brief Constructor + /// \param [in] buf source chunk to iterate on + /// \param [in] size size of source chunk to iterate on in bytes + /// \param [in] eom_ trigger to activate if end of data has been reached (no next chunk anymore) + SrcIterator( const char* buf, std::size_t size, jmp_buf* eom_=0) + :m_itr(const_cast<char*>(buf)) + ,m_end(m_itr+size) + ,m_eom(eom_){} + + /// \brief Assingment operator + SrcIterator& operator=( const SrcIterator& o) + { + m_itr = o.m_itr; + m_end = o.m_end; + m_eom = o.m_eom; + return *this; + } + + /// \brief Element access operator (required by textwolf for an input iterator) + inline char operator*() + { + if (m_itr >= m_end) + { + if (m_eom) longjmp(*m_eom,1); + return 0; + } + return *m_itr; + } + + /// \brief Prefix increment operator (required by textwolf for an input iterator) + inline SrcIterator& operator++() + { + ++m_itr; + return *this; + } + + /// \brief Get the iterator difference in bytes + inline std::size_t operator-( const SrcIterator& b) const + { + if (b.m_end != m_end || m_itr < b.m_itr) throw std::logic_error( "illegal operation"); + return m_itr - b.m_itr; + } + + /// \brief Feed input to the source iterator + /// \param[in] buf poiner to start of input + /// \param[in] size size of input passed in bytes + /// \param[in] eom longjmp to call with parameter 1, if the end of data has been reached before EOF (null termination), eom=null, if the chunk passed contains the complete reset of the input and eof (null) can be returned if we reach the end + void putInput( const char* buf, std::size_t size, jmp_buf* eom=0) + { + m_itr = const_cast<char*>(buf); + m_end = m_itr+size; + m_eom = eom; + } + + /// \brief Get the current position in the current chunk parsed + /// \remark Does not return the absolute position in the source parsed but the position in the chunk + std::size_t getPosition() const + { + return (m_end >= m_itr)?(m_end-m_itr):0; + } + +private: + char* m_itr; + char* m_end; + jmp_buf* m_eom; +}; + +}//namespace +#endif + + diff --git a/textwolf/include/textwolf/staticbuffer.hpp b/textwolf/include/textwolf/staticbuffer.hpp new file mode 100644 index 0000000..8bbadb8 --- /dev/null +++ b/textwolf/include/textwolf/staticbuffer.hpp @@ -0,0 +1,179 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/staticbuffer.hpp +/// \brief Fixed size buffer fulfilling the requirement of a back insertion sequence needed for textwolf output + +#ifndef __TEXTWOLF_STATIC_BUFFER_HPP__ +#define __TEXTWOLF_STATIC_BUFFER_HPP__ +#include "textwolf/exception.hpp" +#include <cstddef> +#include <cstring> +#include <cstdlib> +#include <stdexcept> + +namespace textwolf { + +/// \class StaticBuffer +/// \brief Simple back insertion sequence for storing the outputs of textwolf in a contant size buffer +class StaticBuffer :public throws_exception +{ +public: + /// \brief Constructor + explicit StaticBuffer( std::size_t n) + :m_pos(0),m_size(n),m_ar(0),m_allocated(true) + { + m_ar = (char*)std::calloc( n, sizeof(char)); + if (!m_ar) throw std::bad_alloc(); + } + + /// \brief Constructor + StaticBuffer( char* p, std::size_t n, std::size_t i=0) + :m_pos(i) + ,m_size(n) + ,m_ar(p) + ,m_allocated(false) + ,m_overflow(false) {} + + /// \brief Copy constructor + StaticBuffer( const StaticBuffer& o) + :m_pos(o.m_pos) + ,m_size(o.m_size) + ,m_ar(0) + ,m_allocated(o.m_allocated) + ,m_overflow(o.m_overflow) + { + m_ar = (char*)std::malloc( m_size * sizeof(char)); + if (!m_ar) throw std::bad_alloc(); + std::memcpy( m_ar, o.m_ar, m_size); + } + + /// \brief Destructor + ~StaticBuffer() + { + if (m_allocated && m_ar) std::free(m_ar); + } + + /// \brief Clear the buffer content + void clear() + { + m_pos = 0; + m_overflow = false; + } + + /// \brief Append one character + /// \param[in] ch the character to append + void push_back( char ch) + { + if (m_pos < m_size) + { + m_ar[m_pos++] = ch; + } + else + { + m_overflow = true; + } + } + + /// \brief Append an array of characters + /// \param[in] cc the characters to append + /// \param[in] ccsize the number of characters to append + void append( const char* cc, std::size_t ccsize) + { + if (m_pos+ccsize > m_size) + { + m_overflow = true; + ccsize = m_size - m_pos; + } + std::memcpy( m_ar+m_pos, cc, ccsize); + m_pos += ccsize; + } + + /// \brief Return the number of characters in the buffer + /// \return the number of characters (bytes) + std::size_t size() const {return m_pos;} + + /// \brief Return the buffer content as 0-terminated string + /// \return the C-string + const char* ptr() const {return m_ar;} + + /// \brief Shrinks the size of the buffer or expands it with c + /// \param [in] n new size of the buffer + /// \param [in] c fill character if n bigger than the current fill size + void resize( std::size_t n, char c=0) + { + if (m_pos>n) + { + m_pos=n; + } + else + { + if (m_size<n) n=m_size; + while (n>m_pos) push_back(c); + } + } + + /// \brief random access of element + /// \param [in] ii + /// \return the character at this position + char operator []( std::size_t ii) const + { + if (ii > m_pos) throw exception( DimOutOfRange); + return m_ar[ii]; + } + + /// \brief random access of element reference + /// \param [in] ii + /// \return the reference to the character at this position + char& at( std::size_t ii) const + { + if (ii > m_pos) throw exception( DimOutOfRange); + return m_ar[ii]; + } + + /// \brief check for array bounds write + /// \return true if a push_back would have caused an array bounds write + bool overflow() const {return m_overflow;} +private: + StaticBuffer(){} ///< non copyable +private: + std::size_t m_pos; ///< current cursor position of the buffer (number of added characters) + std::size_t m_size; ///< allocation size of the buffer in bytes + char* m_ar; ///< buffer content + bool m_allocated; ///< true, if the buffer is allocated by this class and not passed by constructor + bool m_overflow; ///< true, if an array bounds write would have happened with push_back +}; + +}//namespace +#endif diff --git a/textwolf/include/textwolf/textscanner.hpp b/textwolf/include/textwolf/textscanner.hpp new file mode 100644 index 0000000..21fa568 --- /dev/null +++ b/textwolf/include/textwolf/textscanner.hpp @@ -0,0 +1,225 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +#ifndef __TEXTWOLF_TEXT_SCANNER_HPP__ +#define __TEXTWOLF_TEXT_SCANNER_HPP__ +/// \file textwolf/textscanner.hpp +/// \brief Implementation of iterator for character-wise parsing of input + +#include "textwolf/char.hpp" +#include "textwolf/charset_interface.hpp" +#include "textwolf/exception.hpp" +#include <cstddef> + +namespace textwolf { + +/// \class TextScanner +/// \brief Reader for scanning the input character by character +/// \tparam Iterator source iterator type (implements preincrement and '*' input byte access indirection) +/// \tparam CharSet character set of the source stream +template <class Iterator, class CharSet> +class TextScanner +{ +private: + Iterator start; ///< source iterator start of current chunk + Iterator input; ///< source iterator + char buf[8]; ///< buffer for one character (the current character parsed) + UChar val; ///< Unicode character representation of the current character parsed + signed char cur; ///< ASCII character representation of the current character parsed or -1 if not in ASCII range + unsigned int state; ///< current state of the text scanner (byte position of iterator cursor in 'buf') + CharSet charset; + +public: + /// \class ControlCharMap + /// \brief Map of ASCII characters to control character identifiers used in the XML scanner automaton + struct ControlCharMap :public CharMap<ControlCharacter,Undef> + { + ControlCharMap() + { + (*this) + (0,EndOfText) + (1,31,Cntrl) + (5,Undef) + (33,127,Any) + (128,255,Undef) + ('\t',Space) + ('\r',Space) + ('\n',EndOfLine) + (' ',Space) + ('&',Amp) + ('<',Lt) + ('=',Equal) + ('>',Gt) + ('/',Slash) + ('-',Dash) + ('!',Exclam) + ('?',Questm) + ('\'',Sq) + ('\"',Dq) + ('[',Osb) + (']',Csb); + }; + }; + + /// \brief Constructor + TextScanner( const CharSet& charset_) + :val(0),cur(0),state(0),charset(charset_) + { + for (unsigned int ii=0; ii<sizeof(buf); ii++) buf[ii] = 0; + } + + TextScanner( const CharSet& charset_, const Iterator& p_iterator) + :start(p_iterator),input(p_iterator),val(0),cur(0),state(0),charset(charset_) + { + for (unsigned int ii=0; ii<sizeof(buf); ii++) buf[ii] = 0; + } + + TextScanner( const Iterator& p_iterator) + :start(p_iterator),input(p_iterator),val(0),cur(0),state(0),charset(CharSet()) + { + for (unsigned int ii=0; ii<sizeof(buf); ii++) buf[ii] = 0; + } + + /// \brief Copy constructor + /// \param [in] orig textscanner to copy + TextScanner( const TextScanner& orig) + :start(orig.start) + ,input(orig.input) + ,val(orig.val) + ,cur(orig.cur) + ,state(orig.state) + ,charset(orig.charset) + { + for (unsigned int ii=0; ii<sizeof(buf); ii++) buf[ii]=orig.buf[ii]; + } + + /// \brief Assign something to the iterator while keeping the state + /// \param [in] a source iterator assignment + template <class IteratorAssignment> + void setSource( const IteratorAssignment& a) + { + input = a; + start = a; + } + + /// \brief Get the current source iterator position + /// \return source iterator position in character words (usually bytes) + std::size_t getPosition() const + { + return input - start; + } + + /// \brief Get the unicode representation of the current character + /// \return the unicode character + inline UChar chr() + { + if (val == 0) + { + val = charset.value( buf, state, input); + } + return val; + } + + /// \brief Fill the internal buffer with as many current character bytes needed for reading the ASCII representation + inline void getcur() + { + cur = CharSet::asciichar( buf, state, input); + } + + /// \class copychar + /// \brief Direct copy of a character from input to output without encoding/decoding it + /// \remark Assumes the character set encodings to be of the same class + template <class Buffer> + inline void copychar( CharSet& output_, Buffer& buf_) + { + /// \todo more efficient solution of copy character to sink with same encoding here + /// \remark a check if the character sets fulfill is_equal(..) (IsoLatin code page !) + if (CharSet::is_equal( charset, output_)) + { + // ... if the character sets are equal and of the same subclass (code pages) + // then we do not decode/encode the character but copy it directly to the output + charset.fetchbytes( buf, state, input); + for (unsigned int ii=0; ii<state; ++ii) buf_.push_back(buf[ii]); + } + else + { + output_.print( chr(), buf_); + } + } + + /// \brief Get the control character representation of the current character + /// \return the control character + inline ControlCharacter control() + { + static ControlCharMap controlCharMap; + getcur(); + return controlCharMap[ (unsigned char)cur]; + } + + /// \brief Get the ASCII character representation of the current character + /// \return the ASCII character or 0 if not defined + inline unsigned char ascii() + { + getcur(); + return cur>=0?(unsigned char)cur:0; + } + + /// \brief Skip to the next character of the source + /// \return *this + inline TextScanner& skip() + { + CharSet::skip( buf, state, input); + state = 0; + cur = 0; + val = 0; + return *this; + } + + /// \brief see TextScanner::chr() + inline UChar operator*() + { + return chr(); + } + + /// \brief Preincrement: Skip to the next character of the source + /// \return *this + inline TextScanner& operator ++() {return skip();} + + /// \brief Postincrement: Skip to the next character of the source + /// \return *this + inline TextScanner operator ++(int) {TextScanner tmp(*this); skip(); return tmp;} +}; + +}//namespace +#endif diff --git a/textwolf/include/textwolf/traits.hpp b/textwolf/include/textwolf/traits.hpp new file mode 100644 index 0000000..15a318d --- /dev/null +++ b/textwolf/include/textwolf/traits.hpp @@ -0,0 +1,65 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +#ifndef __TEXTWOLF_TRAITS_HPP__ +#define __TEXTWOLF_TRAITS_HPP__ +/// \file textwolf/traits.hpp +/// \brief Type traits + +namespace textwolf { +namespace traits { + +/// \class TypeCheck +/// \brief Test structure to stear the compiler +class TypeCheck +{ +public: + struct YES {}; + struct NO {}; + + template<typename T, typename U> + struct is_same + { + static const NO type() {return NO();} + }; + + template<typename T> + struct is_same<T,T> + { + static const YES type() {return YES();} + }; +}; + +}}//namespace +#endif diff --git a/textwolf/include/textwolf/xmlhdrparser.hpp b/textwolf/include/textwolf/xmlhdrparser.hpp new file mode 100644 index 0000000..f18b76b --- /dev/null +++ b/textwolf/include/textwolf/xmlhdrparser.hpp @@ -0,0 +1,411 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this Object refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/xmlhdrparser.hpp +/// \brief Class for parsing the header to get the character set encoding + +#ifndef __TEXTWOLF_XML_HEADER_PARSER_HPP__ +#define __TEXTWOLF_XML_HEADER_PARSER_HPP__ +#include <cstdlib> +#include "textwolf/sourceiterator.hpp" + +/// \namespace textwolf +/// \brief Toplevel namespace of the library +namespace textwolf { + +/// \class XmlHdrParser +/// \brief Class for parsing the header to get the character set encoding +/// \remark Works with all single byte or multibyte character sets with ASCII as base +class XmlHdrParser +{ +public: + /// \brief Constructor + XmlHdrParser() + :m_state(Init) + ,m_attributetype(Encoding) + ,m_idx(0) + ,m_charsConsumed(0) + ,m_zeroCount(0){} + + /// \brief Copy constructor + /// \brief param[in] o object to copy + XmlHdrParser( const XmlHdrParser& o) + :m_state(o.m_state) + ,m_attributetype(o.m_attributetype) + ,m_idx(o.m_idx) + ,m_charsConsumed(o.m_charsConsumed) + ,m_zeroCount(o.m_zeroCount) + ,m_item(o.m_item) + ,m_src(o.m_src){} + + + /// \brief Add another input chunk to process + /// \param[in] src_ pointer to chunk + /// \param[in] srcsize_ size of chunk in bytes + void putInput( const char* src_, std::size_t srcsize_) + { + m_src.append( src_, srcsize_); + } + + /// \brief Get the whole original data added with subsequent calls of putInput(const char*,std::size_t) + /// \return the data block as string reference + const std::string& consumedData() const + { + return m_src; + } + + /// \brief Call the first/next iteration of parsing the header + /// \return true on success, false if more data is needed (putInput(const char*,std::size_t)) or if an error occurred. Check lasterror() for an error + bool parse() + { + unsigned char ch = nextChar(); + for (;ch != 0; ch = nextChar()) + { + switch (m_state) + { + case Init: + if (ch == '<') + { + m_state = ParseXmlOpen; + } + else if (ch <= 32) + { + continue; + } + else + { + setError( "expected open tag angle bracket '>'"); + return false; + } + break; + + case ParseXmlOpen: + if (ch == '?') + { + m_state = ParseXmlHdr; + } + else if (ch <= 32) + { + break; + } + else if (((ch|32) >= 'a' && (ch|32) <= 'z') || ch == '_') + { + return true; + } + else + { + setError( "expected xml header question mark '?' after open tag angle bracket '<'"); + return false; + } + break; + + case ParseXmlHdr: + if (ch <= 32 || ch == '?') + { + if (m_item != "xml") + { + setError( "expected '<?xml' as xml header start"); + return false; + } + m_item.clear(); + if (ch == '?') return true; /*...."<?xml?>"*/ + + m_state = FindAttributeName; + } + else if (((ch|32) >= 'a' && (ch|32) <= 'z') || ch == '_') + { + m_item.push_back(ch); + continue; + } + else if (ch == '>') + { + setError( "unexpected close angle bracket '>' in xml header after '<?xml'"); + return false; + } + else + { + setError( "expected '<?xml' as xml header start (invalid character)"); + return false; + } + break; + + case FindAttributeName: + if (ch <= 32) + { + continue; + } + else if (ch == '>' || ch == '?') + { + if (ch == '>') + { + setError( "unexpected close angle bracket '>' in xml header (missing '?')"); + return false; + } + return true; + } + else if (((ch|32) >= 'a' && (ch|32) <= 'z') || ch == '_') + { + m_item.push_back(ch); + m_state = ParseAttributeName; + } + else + { + setError( "invalid character in xml header attribute name"); + return false; + } + break; + case ParseAttributeName: + if (ch <= 32 || ch == '=') + { + if (m_item == "encoding") + { + m_attributetype = Encoding; + } + else if (m_item == "version") + { + m_attributetype = Version; + } + else if (m_item == "standalone") + { + m_attributetype = Standalone; + } + else + { + setError( "unknown xml header attribute name"); + return false; + } + m_item.clear(); + if (ch == '=') + { + m_state = FindAttributeValue; + continue; + } + m_state = FindAttributeAssign; + } + else if (((ch|32) >= 'a' && (ch|32) <= 'z') || ch == '_') + { + m_item.push_back(ch); + continue; + } + else + { + setError( "invalid character in xml header attribute name"); + return false; + } + break; + case FindAttributeAssign: + if (ch == '=') + { + m_state = FindAttributeValue; + } + else if (ch <= 32) + { + continue; + } + else + { + setError( "expected '=' after xml header attribute name"); + return false; + } + break; + case FindAttributeValue: + if (ch == '"') + { + m_state = ParseAttributeValueDq; + continue; + } + else if (ch == '\'') + { + m_state = ParseAttributeValueSq; + continue; + } + else if (ch <= 32) + { + continue; + } + else + { + setError( "expected single or double quote string as xml header attribute value"); + return false; + } + break; + case ParseAttributeValueSq: + if (ch == '\'') + { + switch (m_attributetype) + { + case Encoding: + m_encoding = m_item; + break; + case Version: + case Standalone: + break; + } + m_item.clear(); + m_state = FindAttributeName; + continue; + } + else + { + m_item.push_back( ch); + } + break; + case ParseAttributeValueDq: + if (ch == '\"') + { + switch (m_attributetype) + { + case Encoding: + m_encoding = m_item; + break; + case Version: + case Standalone: + break; + } + m_item.clear(); + m_state = FindAttributeName; + continue; + } + else + { + m_item.push_back( ch); + } + break; + }/*switch(..)*/ + }/*for(;..;..)*/ + return false; + } + + /// \brief Get the last error occurred + /// \return the pointer to the last error or NULL if no error occurred + const char* lasterror() const + { + return m_lasterror.empty()?0:m_lasterror.c_str(); + } + + /// \brief Get the encoding specified as attribute in the header + /// \return the encoding or NULL if not specified or not encountered yet in the source parsed + const char* encoding() const + { + return m_encoding.empty()?0:m_encoding.c_str(); + } + + /// \brief Get the number of ASCII characters consumed + /// \return the number of characters + std::size_t charsConsumed() const + { + return m_charsConsumed; + } + + /// \brief Clear the data, reset the state + void clear() + { + m_state = Init; + m_attributetype = Encoding; + m_idx = 0; + m_charsConsumed = 0; + m_zeroCount = 0; + m_item.clear(); + m_src.clear(); + m_encoding.clear(); + m_lasterror.clear(); + } + +private: + void setError( const std::string& m) + { + m_lasterror = m; + } + + unsigned char nextChar() + { + for (; m_zeroCount<4; m_zeroCount++) + { + if (m_idx >= m_src.size()) return 0; + unsigned char ch = m_src[m_idx]; + ++m_idx; + if (ch != 0) + { + m_zeroCount = 0; + if (ch > 32) + { + ++m_charsConsumed; + } + return ch; + } + } + throw std::runtime_error( "illegal XML header (more than 4 null bytes in a row)"); + } + + enum State + { + Init, + ParseXmlOpen, + ParseXmlHdr, + FindAttributeName, + ParseAttributeName, + FindAttributeAssign, + FindAttributeValue, + ParseAttributeValueSq, + ParseAttributeValueDq + }; + + enum AttributeType + { + Encoding, + Version, + Standalone + }; + + static const char* stateName( State i) + { + static const char* ar[] = {"Init","ParseXmlOpen","ParseXmlHdr","FindAttributeName","ParseAttributeName","FindAttributeAssign","FindAttributeValue","ParseAttributeValueSq","ParseAttributeValueDq"}; + return ar[ (int)i]; + } + +private: + State m_state; ///< header parsing state + AttributeType m_attributetype; ///< currently parsed attribute type + std::size_t m_idx; ///< source index (index in m_src) + std::size_t m_charsConsumed; ///< number of characters consumed + std::size_t m_zeroCount; ///< counter of subsequent null bytes + std::string m_item; ///< parsed item + std::string m_src; ///< source buffered + std::string m_encoding; ///< character set encoding parsed + std::string m_lasterror; ///< last error +}; + +}//namespace +#endif + diff --git a/textwolf/include/textwolf/xmlpathautomaton.hpp b/textwolf/include/textwolf/xmlpathautomaton.hpp new file mode 100644 index 0000000..9ce8896 --- /dev/null +++ b/textwolf/include/textwolf/xmlpathautomaton.hpp @@ -0,0 +1,778 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/xmlpathautomaton.hpp +/// \brief Automaton to select path expressions from an XML iterator + +#ifndef __TEXTWOLF_XML_PATH_AUTOMATON_HPP__ +#define __TEXTWOLF_XML_PATH_AUTOMATON_HPP__ +#include "textwolf/char.hpp" +#include "textwolf/charset.hpp" +#include "textwolf/exception.hpp" +#include "textwolf/xmlscanner.hpp" +#include "textwolf/staticbuffer.hpp" +#include <limits> +#include <sstream> +#include <string> +#include <vector> +#include <map> +#include <cstddef> +#include <stdexcept> + +namespace textwolf { + +///\class XMLPathSelectAutomaton +///\tparam CharSet_ character set of the token defintions of the automaton +///\brief Automaton to define XML path expressions and assign types (int values) to them +template <class CharSet_=charset::UTF8> +class XMLPathSelectAutomaton :public throws_exception +{ +public: + enum + { + defaultMemUsage=3*1024, //< default memory usage of the XML path select process, if not specified else + defaultMaxDepth=32 //< default max tag stack depth, if not specified else + }; + std::size_t memUsage; //< total memory usage + unsigned int maxDepth; //< max tag stack depth + std::size_t maxScopeStackSize; //< max scope stack depth + unsigned int maxFollows; //< maximum number of tokens searched in depth + unsigned int maxTriggers; //< maximum number of open triggers + unsigned int maxTokens; //< maximum number of open tokens + +public: + ///\brief Constructor + XMLPathSelectAutomaton() + :memUsage(defaultMemUsage) + ,maxDepth(defaultMaxDepth) + ,maxScopeStackSize(0) + ,maxFollows(0) + ,maxTriggers(0) + ,maxTokens(0) + { + if (!setMemUsage( memUsage, maxDepth)) throw exception( DimOutOfRange); + } + typedef CharSet_ CharSet; + typedef int Hash; + typedef XMLPathSelectAutomaton<CharSet> ThisXMLPathSelectAutomaton; + + virtual ~XMLPathSelectAutomaton(){} + +public: + ///\enum Operation + ///\brief Enumeration of operation types in the automaton definition + enum Operation + { + Content, //< searching content token + Tag, //< searching a tag + Attribute, //< searching an attribute + ThisAttributeValue, //< checking the value of the attribute just parsed (not an arbitrary but this one) + AttributeValue, //< searching a value of an attribute + ContentStart //< looking for the start of content (to signal the end of the XML header) + }; + + ///\brief Get the name of the operation as string + ///\return the operation as string + static const char* operationName( Operation op) + { + static const char* name[ 6] = {"Content", "Tag", "Attribute", "ThisAttributeValue", "AttributeValue", "ContentStart"}; + return name[ (unsigned int)op]; + } + + ///\class Mask + ///\brief Mask to query for element types, if they match or not + struct Mask + { + unsigned short pos; //< positively selected elements bitmask + unsigned short neg; //< negatively selected elements bitmask that determines when a search pattern is given up copletely + + ///\brief Tells if mask does not select anything anymore + ///\return true if it is not active anymore + bool empty() const {return (pos==0);} + + ///\brief Constructor by values + ///\param [in] p_pos positively selected elements bitmask + ///\param [in] p_neg negatively selected elements bitmask that determines when a search pattern is given up copletely + Mask( unsigned short p_pos=0, unsigned short p_neg=0):pos(p_pos),neg(p_neg) {} + + ///\brief Copy constructor + ///\param[in] orig mask to copy + Mask( const Mask& orig) :pos(orig.pos),neg(orig.neg) {} + + ///\brief Constructor by operation type + Mask( Operation op) :pos(0),neg(0) {this->match(op);} + + ///\brief Reset operation (deactivate) + void reset() {pos=0; neg=0;} + + ///\brief Deactivate operation for a certain element type + void reject( XMLScannerBase::ElementType e) {neg |= (1<<(unsigned short)e);} + bool hasReject( XMLScannerBase::ElementType e) const {return (neg & (1<<(unsigned short)e)) != 0;} + + ///\brief Declare an operation to match on an element type + void match( XMLScannerBase::ElementType e) {pos |= (1<<(unsigned short)e);} + bool hasMatch( XMLScannerBase::ElementType e) const {return (pos & (1<<(unsigned short)e)) != 0;} + + ///\brief Declare an operation as seek operation + void seekop( Operation op) + { + switch (op) + { + case Tag: + this->match( XMLScannerBase::OpenTag); + this->match( XMLScannerBase::HeaderStart); + break; + case Attribute: + this->match( XMLScannerBase::TagAttribName); + this->match( XMLScannerBase::HeaderAttribName); + this->reject( XMLScannerBase::Content); + break; + case ThisAttributeValue: + this->match( XMLScannerBase::TagAttribValue); + this->match( XMLScannerBase::HeaderAttribValue); + this->reject( XMLScannerBase::TagAttribName); + this->reject( XMLScannerBase::HeaderAttribName); + this->reject( XMLScannerBase::Content); + this->reject( XMLScannerBase::OpenTag); + break; + case AttributeValue: + this->match( XMLScannerBase::TagAttribValue); + this->match( XMLScannerBase::HeaderAttribValue); + this->reject( XMLScannerBase::Content); + break; + case Content: + this->match( XMLScannerBase::Content); + break; + case ContentStart: + this->match( XMLScannerBase::HeaderEnd); + break; + } + } + + const char* seekopName() const + { + if (this->hasMatch( XMLScannerBase::OpenTag) + && this->hasMatch( XMLScannerBase::HeaderStart)) + return "Tag"; + + if (this->hasMatch( XMLScannerBase::TagAttribName) + && this->hasMatch( XMLScannerBase::HeaderAttribName) + && this->hasReject( XMLScannerBase::Content)) + return "Attribute"; + + if (this->hasMatch( XMLScannerBase::TagAttribValue) + && this->hasMatch( XMLScannerBase::HeaderAttribValue) + && this->hasReject( XMLScannerBase::Content)) + return "AttributeValue"; + + if (this->hasMatch( XMLScannerBase::TagAttribValue) + && this->hasMatch( XMLScannerBase::HeaderAttribValue) + && this->hasReject( XMLScannerBase::TagAttribName) + && this->hasReject( XMLScannerBase::HeaderAttribName) + && this->hasReject( XMLScannerBase::Content) + && this->hasReject( XMLScannerBase::OpenTag)) + return "ThisAttributeValue"; + + if (this->hasMatch( XMLScannerBase::Content)) + return "Content"; + + if (this->hasMatch( XMLScannerBase::HeaderEnd)) + return "ContentStart"; + + if (pos == 0 && neg == 0) + return "None"; + + return ""; + } + + ///\brief Join two mask definitions + ///\param[in] mask definition of mask to join this with + void join( const Mask& mask) {pos |= mask.pos; neg |= mask.neg;} + + ///\brief Check if an element type matches the mask + ///\param[in] e element type to check + bool matches( XMLScannerBase::ElementType e) const {return (0 != (pos & (1<<(unsigned short)e)));} + + ///\brief Check if an element type should reset a mask + ///\param[in] e element type to check + bool rejects( XMLScannerBase::ElementType e) const {return (0 != (neg & (1<<(unsigned short)e)));} + }; + + ///\class Core + ///\brief Core of an automaton state definition that is used during XML processing + struct Core + { + Mask mask; //< mask definiting what tokens are matching this state + bool follow; //< true, if the state is seeking tokens in all follow scopes in the XML tree + int typeidx; //< type of the element emitted by this state on a match + int cnt_start; //< lower bound of the element index matching (for index ranges) + int cnt_end; //< upper bound of the element index matching (for index ranges) + + ///\brief Constructor + Core() :follow(false),typeidx(0),cnt_start(0),cnt_end(-1) {} + ///\brief Copy constructor + ///\param [in] o element to copy + Core( const Core& o) :mask(o.mask),follow(o.follow),typeidx(o.typeidx),cnt_start(o.cnt_start),cnt_end(o.cnt_end) {} + }; + + ///\class State + ///\brief State of an automaton in its definition + struct State + { + Core core; //< core of the state (the part used in processing) + unsigned int keysize; //< key size of the element + char* key; //< key of the element + char* srckey; //< key of the element as in source (for debugging or reporting, etc.) + int next; //< follow state + int link; //< alternative state to check + + ///\brief Constructor + State() + :keysize(0),key(0),srckey(0),next(-1),link(-1) {} + + ///\brief Copy constructor + ///\param [in] orig element to copy + State( const State& orig) + :core(orig.core),keysize(orig.keysize),key(0),srckey(0),next(orig.next),link(orig.link) + { + defineKey( orig.keysize, orig.key, orig.srckey); + } + + ///\brief Destructor + ~State() + { + if (key) delete [] key; + if (srckey) delete [] srckey; + } + + ///\brief Check it the state definition is empty + ///\return true for an empty state + bool isempty() {return key==0&&core.typeidx==0;} + + ///\brief Define the matching key of this state + ///\param[in] p_keysize size of the key in bytes + ///\param[in] p_key pointer to the key + ///\param[in] p_srckey the source form of the key (ASCII with encoded entities for everything else) + void defineKey( unsigned int p_keysize, const char* p_key, const char* p_srckey) + { + unsigned int ii; + if (key) + { + delete [] key; + key = 0; + } + if (srckey) + { + delete [] srckey; + srckey = 0; + } + if (p_key) + { + key = new char[ keysize=p_keysize]; + for (ii=0; ii<keysize; ii++) key[ii]=p_key[ii]; + } + if (p_srckey) + { + for (ii=0; p_srckey[ii]!=0; ii++); + srckey = new char[ ii+1]; + for (ii=0; p_srckey[ii]!=0; ii++) srckey[ii]=p_srckey[ii]; + srckey[ ii] = 0; + } + } + + ///\brief Define a state transition by key and operation + ///\param[in] op operation type + ///\param[in] p_keysize size of the key in bytes + ///\param[in] p_key pointer to the key + ///\param[in] p_srckey the source form of the key (ASCII with encoded entities for everything else) + ///\param[in] p_next follow state on a match + ///\param[in] p_follow true if the search reaches all included follow scopes of the definition scope + void defineNext( Operation op, unsigned int p_keysize, const char* p_key, const char* p_srckey, int p_next, bool p_follow=false) + { + core.mask.seekop( op); + defineKey( p_keysize, p_key, p_srckey); + next = p_next; + core.follow = p_follow; + } + + ///\brief Define an element output operation + ///\param[in] mask mask defining the element types to output + ///\param[in] p_typeidx the type of the element produced + ///\param[in] p_follow true if the output reaches all included follow scopes of the definition scope + ///\param[in] p_start start index of the element range produced + ///\param[in] p_end upper bound index of the element range produced + void defineOutput( const Mask& mask, int p_typeidx, bool p_follow, int p_start, int p_end) + { + core.mask = mask; + core.typeidx = p_typeidx; + core.cnt_end = p_end; + core.cnt_start = p_start; + core.follow = p_follow; + } + + ///\brief Link another state to check to the current state + ///\param[in] p_link the index of the state to link + void defLink( int p_link) + { + link = p_link; + } + + std::string tostring() const + { + std::ostringstream rt; + if (next >= 0) rt << " ->" << next; + if (link >= 0) rt << " ~" << link; + rt << ' '; + if (core.follow) + { + rt << '/'; + } + rt << '/'; + rt << core.mask.seekopName(); + if (srckey) + { + rt << " '" << srckey << "'"; + } + else + { + rt << " (null)"; + } + if (core.cnt_end > 0) + { + rt << '[' << core.cnt_start << ',' << rt << core.cnt_end << ']'; + } + if (core.typeidx) + { + rt << " =>" << core.typeidx; + } + return rt.str(); + } + }; + std::vector<State> states; //< the states of the statemachine + + ///\brief Returns the content of the automaton as pretty printed string for debug output + std::string tostring() const + { + std::ostringstream rt; + typename std::vector<State>::const_iterator ii=states.begin(), ee=states.end(); + for (; ii != ee; ++ii) + { + rt << (int)(ii-states.begin()) << ": " << ii->tostring() << std::endl; + } + return rt.str(); + } + + ///\class Token + ///\brief Active or passive but still valid token of the XML processing (this is a trigger waiting to match) + struct Token + { + Core core; //< core of the state + int stateidx; //< index into the automaton, poiting to the state + + ///\brief Constructor + Token() :stateidx(-1) {} + ///\brief Copy constructor + Token( const Token& orig) :core(orig.core),stateidx(orig.stateidx) {} + ///\brief Constructor by value + ///\param [in] state state that generated this token + ///\param [in] p_stateidx index of the state that generated this token + Token( const State& state, int p_stateidx) :core(state.core),stateidx(p_stateidx) {} + }; + + ///\class Scope + ///\brief Tag scope definition + struct Scope + { + Mask mask; //< joined mask of all tokens active in this scope + Mask followMask; //< joined mask of all tokens active in this and all sub scopes of this scope + + ///\class Range + ///\brief Range on the token stack with all tokens that belong to this scope + struct Range + { + unsigned int tokenidx_from; //< lower bound token index + unsigned int tokenidx_to; //< upper bound token index + unsigned int followidx; //< pointer to follow token stack with tokens active in this and all sub scopes of this scope + + ///\brief Constructor + Range() :tokenidx_from(0),tokenidx_to(0),followidx(0) {} + ///\brief Copy constructor + ///\param[in] orig scope to copy + Range( const Scope& orig) :tokenidx_from(orig.tokenidx_from),tokenidx_to(orig.tokenidx_to),followidx(orig.followidx) {} + }; + Range range; //< valid (active) token range of this scope (on the token stacks) + + ///\brief Copy constructor + ///\param[in] orig scope to copy + Scope( const Scope& orig) :mask(orig.mask),followMask(orig.followMask),range(orig.range) {} + ///\brief Assignement operator + ///\param[in] orig scope to copy + Scope& operator =( const Scope& orig) {mask=orig.mask; followMask=orig.followMask; range=orig.range; return *this;} + ///\brief Constructor + Scope() {} + }; + + ///\brief Defines the usage of memory + ///\param [in] p_memUsage size of the memory block in bytes + ///\param [in] p_maxDepth maximum depht of the scope stack + ///\return true, if everything is OK + bool setMemUsage( std::size_t p_memUsage, unsigned int p_maxDepth) + { + memUsage = p_memUsage; + maxDepth = p_maxDepth; + maxScopeStackSize = maxDepth; + if (p_memUsage < maxScopeStackSize * sizeof(Scope)) + { + maxScopeStackSize = 0; + } + else + { + p_memUsage -= maxScopeStackSize * sizeof(Scope); + } + maxFollows = (p_memUsage / sizeof(std::size_t)) / 32 + 2; + maxTriggers = (p_memUsage / sizeof(std::size_t)) / 32 + 3; + p_memUsage -= sizeof(std::size_t) * maxFollows + sizeof(std::size_t) * maxTriggers; + maxTokens = p_memUsage / sizeof(Token); + return (maxScopeStackSize != 0 && maxTokens != 0 && maxFollows != 0 && maxTriggers != 0); + } + +private: + ///\brief Defines a state transition + ///\param [in] stateidx from what source state + ///\param [in] op operation firing the state transition + ///\param [in] keysize length of the key firing the state transition in bytes + ///\param [in] key the key string firing the state transition in bytes + ///\param [in] srckey the ASCII encoded representation in the source + ///\param [in] follow true, uf the state transition is active for all sub scopes of the activation state + ///\return the target state of the transition defined + int defineNext( int stateidx, Operation op, unsigned int keysize, const char* key, const char* srckey, bool follow=false) throw(exception) + { + try + { + State state; + if (states.size() == 0) + { + stateidx = states.size(); + states.push_back( state); + } + for (int ee=stateidx; ee != -1; stateidx=ee,ee=states[ee].link) + { + if (states[ee].key != 0 && keysize == states[ee].keysize && states[ee].core.follow == follow) + { + unsigned int ii; + for (ii=0; ii<keysize && states[ee].key[ii]==key[ii]; ii++); + if (ii == keysize) return states[ee].next; + } + } + if (!states[stateidx].isempty()) + { + stateidx = states[stateidx].link = states.size(); + states.push_back( state); + } + states.push_back( state); + unsigned int lastidx = states.size()-1; + states[ stateidx].defineNext( op, keysize, key, srckey, lastidx, follow); + return stateidx=lastidx; + } + catch (std::bad_alloc) + { + throw exception( OutOfMem); + } + catch (...) + { + throw exception( Unknown); + } + } + + ///\brief Defines an output print action and output type for a state + ///\param [in] stateidx from what source state + ///\param [in] printOpMask mask for elements printed + ///\param [in] typeidx type identifier + ///\param [in] follow true, uf the state transition is active for all sub scopes of the activation state + ///\param [in] start start of index range where this state transition fires + ///\param [in] end end of index range where this state transition fires + ///\return index of the state where this output action was defined + int defineOutput( int stateidx, const Mask& printOpMask, int typeidx, bool follow, int start, int end) throw(exception) + { + try + { + State state; + if (states.size() == 0) + { + stateidx = states.size(); + states.push_back( state); + } + if ((unsigned int)stateidx >= states.size()) throw exception( IllegalParam); + + if (!states[stateidx].isempty()) + { + stateidx = states[stateidx].link = states.size(); + states.push_back( state); + } + states[ stateidx].defineOutput( printOpMask, typeidx, follow, start, end); + return stateidx; + } + catch (std::bad_alloc) + { + throw exception( OutOfMem); + } + catch (...) + { + throw exception( Unknown); + } + } + +public: + ///\class PathElement + ///\brief Defines one node in the XML Path element tree in the construction phase. + ///\remark This is just a construct for building the tree with cascading operators forming a path representation + struct PathElement :throws_exception + { + private: + XMLPathSelectAutomaton* xs; //< XML Path select automaton where this node is an element of + int stateidx; //< state of this element in the automaton + + ///\class Range + ///\brief Element counting range defining what are indices of valid elements + struct Range + { + int start; //< index of starting element starting with 0 + int end; //< index of upper boundary element (not belonging to range anymore). -1 if undefined (unlimited) + + ///\brief Copy constructor + ///\param [in] o range element to copy + Range( const Range& o) :start(o.start),end(o.end){} + ///\brief Constructor by value + ///\param [in] p_start index of starting element + ///\param [in] p_end index of upper boundary element (not belonging to range anymore). -1 if undefined (unlimited) + Range( int p_start, int p_end) :start(p_start),end(p_end){} + ///\brief Constructor by value + ///\param [in] count number of elements starting with the first one (with index 0) + Range( int count) :start(0),end(count){} + ///\brief Constructor + Range() :start(0),end(-1){} + }; + Range range; //< Index range of this XML path element + bool follow; //< true, if this element is active (firing) for all sub scopes of the activation scope + Mask pushOpMask; //< mask for firing element actions + Mask printOpMask; //< mask for printing element actions + + private: + ///\brief Define an output operation for a certain element type in this state + ///\param [in] op XML operation type of this output + ///\return *this + PathElement& defineOutput( Operation op) + { + printOpMask.reset(); + printOpMask.seekop( op); + return *this; + } + + ///\brief Define a state transition operation for a token of a certain element type in this state + ///\param [in] op XML operation type of this state transition + ///\param [in] value key value as ASCII with encoded entities for higher unicode characters of this state transition + ///\return *this + PathElement& doSelect( Operation op, const char* value) throw(exception) + { + static XMLScannerBase::IsTagCharMap isTagCharMap; + if (xs != 0) + { + if (value) + { + char buf[ 1024]; + StaticBuffer pb( buf, sizeof(buf)); + char* itr = const_cast<char*>(value); + typedef XMLScanner<char*,CharSet,CharSet,StaticBuffer> StaticXMLScanner; + if (!StaticXMLScanner::parseStaticToken( isTagCharMap, itr, pb)) + { + throw exception( IllegalAttributeName); + } + stateidx = xs->defineNext( stateidx, op, pb.size(), pb.ptr(), value, follow); + } + else + { + stateidx = xs->defineNext( stateidx, op, 0, 0, 0, follow); + } + } + return *this; + } + + ///\brief Define this element as active (firing,printing) for all sub scopes of the activation scope + ///\return *this + PathElement& doFollow() + { + follow = true; + return *this; + } + + ///\brief Define a valid range of token count for this element to be active + ///\param [in] p_start index of starting element starting with 0 + ///\param [in] p_end index of upper boundary element (not belonging to range anymore). -1 if undefined (unlimited) + ///\return *this + PathElement& doRange( int p_start, int p_end) + { + if (range.end == -1) + { + range = Range( p_start, p_end); + } + else if (p_end < range.end) + { + range.end = p_end; + } + else if (p_start > range.start) + { + range.start = p_start; + } + return *this; + } + + ///\brief Define a valid range of token count for this element to be active by the number of elements + ///\param [in] p_count number of elements starting with 0 + ///\return *this + PathElement& doCount( int p_count) + { + return doRange( 0, p_count); + } + + ///\brief Define the start of the range of token count for this element to be active + ///\param [in] p_start index of starting element starting with 0 + ///\return *this + PathElement& doStart( int p_start) + { + return doRange( p_start, std::numeric_limits<int>::max()); + } + + ///\brief Define the output of the current element + ///\param [in] typeidx type of the element produced + ///\return *this + PathElement& push( int typeidx) throw(exception) + { + if (xs != 0) stateidx = xs->defineOutput( stateidx, printOpMask, typeidx, follow, range.start, range.end); + return *this; + } + + public: + ///\brief Constructor + PathElement() :xs(0),stateidx(0),follow(false),pushOpMask(0),printOpMask(0){} + ///\brief Constructor by values + ///\param [in] p_xs automaton of this element + ///\param [in] p_si state index of this element in the automaton definition + PathElement( XMLPathSelectAutomaton* p_xs, int p_si=0) :xs(p_xs),stateidx(p_si),follow(false),pushOpMask(0),printOpMask(0){} + ///\brief Copy constructor + ///\param [in] orig element to copy + PathElement( const PathElement& orig) :xs(orig.xs),stateidx(orig.stateidx),range(orig.range),follow(orig.follow),pushOpMask(orig.pushOpMask),printOpMask(orig.printOpMask) {} + + ///\brief Corresponds to "//" in abbreviated syntax of XPath + ///\return *this + PathElement& operator --(int) {return doFollow();} + ///\brief Find tag by name + ///\param [in] name name of the tag + ///\return *this + ///\remark same as selectTag(const char*) + PathElement& operator []( const char* name) throw(exception) {return doSelect( Tag, name);} + ///\brief Find tag by name + ///\param [in] name name of the tag + ///\return *this + PathElement& selectTag( const char* name) throw(exception) {return doSelect( Tag, name);} + + ///\brief Find tag with one attribute + ///\param [in] name name of the attribute + ///\return *this + ///\remark same as selectAttribute(const char*) + PathElement& operator ()( const char* name) throw(exception) {return doSelect( Attribute, name).defineOutput( ThisAttributeValue);} + ///\brief Find tag with one attribute + ///\param [in] name name of the attribute + ///\return *this + PathElement& selectAttribute( const char* name) throw(exception) {return doSelect( Attribute, name).defineOutput( ThisAttributeValue);} + + ///\brief Find tag with one attribute,value condition + ///\remark same as ifAttribute(const char*,const char*) + ///\param [in] name name of the attribute + ///\param [in] value value of the attribute + ///\return *this + PathElement& operator ()( const char* name, const char* value) throw(exception) {return doSelect( Attribute, name).doSelect( ThisAttributeValue, value);} + + ///\brief Find tag with one attribute,value condition + ///\param [in] name name of the attribute + ///\param [in] value value of the attribute + ///\return *this + PathElement& ifAttribute( const char* name, const char* value) throw(exception) {return doSelect( Attribute, name).doSelect( ThisAttributeValue, value);} + + ///\brief Define maximum element index to push + ///\param [in] idx maximum element index + ///\return *this + PathElement& TO(int idx) throw(exception) {return doCount((idx>=0)?(idx+1):-1);} + ///\brief Define minimum element index to push + ///\param [in] idx minimum element index + ///\return *this + PathElement& FROM(int idx) throw(exception) {return doStart(idx); return *this;} + ///\brief Define minimum and maximum element index to push + ///\param [in] idx1 minimum element index + ///\param [in] idx2 maximum element index + ///\return *this + PathElement& RANGE(int idx1, int idx2) throw(exception) {return doRange(idx1,(idx2>=0)?(idx2+1):-1); return *this;} + ///\brief Define index of the element index to push + ///\param [in] idx element index + ///\return *this + PathElement& INDEX(int idx) throw(exception) {return doRange(idx,idx+1); return *this;} + + ///\brief Define element type to push + ///\param [in] type element type + ///\return *this + ///\remark same as assignType(int) + PathElement& operator =(int type) throw(exception) {return push( type);} + ///\brief Define element type to push + ///\param [in] type element type + ///\return *this + PathElement& assignType(int type) throw(exception) {return push( type);} + + ///\brief Define grab content + ///\remark same as selectContent() + ///\return *this + PathElement& operator ()() throw(exception) {return defineOutput(Content);} + ///\brief Define grab content + ///\return *this + PathElement& selectContent() throw(exception) {return defineOutput(Content);} + }; + + ///\brief Get automaton root element to start an XML path definition + ///\return the automaton root element + PathElement operator*() + { + return PathElement( this); + } +}; + +} //namespace +#endif diff --git a/textwolf/include/textwolf/xmlpathautomatonparse.hpp b/textwolf/include/textwolf/xmlpathautomatonparse.hpp new file mode 100644 index 0000000..33856de --- /dev/null +++ b/textwolf/include/textwolf/xmlpathautomatonparse.hpp @@ -0,0 +1,245 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/xmlpathautomatonparse.hpp +/// \brief Parser to create a path expression selector automaton from a source (list of path expression in abbreviated syntax of xpath) + +#ifndef __TEXTWOLF_XML_PATH_AUTOMATON_PARSE_HPP__ +#define __TEXTWOLF_XML_PATH_AUTOMATON_PARSE_HPP__ +#include "textwolf/xmlpathautomaton.hpp" +#include "textwolf/charset.hpp" +#include "textwolf/cstringiterator.hpp" +#include <limits> +#include <string> +#include <vector> +#include <cstring> +#include <cstddef> +#include <stdexcept> + +namespace textwolf { + +///\class XMLPathSelectAutomatonParser +///\tparam SrcCharSet character set of the automaton definition source +///\tparam AtmCharSet character set of the token defintions of the automaton +///\brief Automaton to define XML path expressions and assign types (int values) to them +template <class SrcCharSet=charset::UTF8, class AtmCharSet=charset::UTF8> +class XMLPathSelectAutomatonParser :public XMLPathSelectAutomaton<AtmCharSet> +{ +public: + typedef XMLPathSelectAutomaton<AtmCharSet> ThisAutomaton; + typedef typename ThisAutomaton::PathElement PathElement; + typedef XMLPathSelectAutomatonParser This; + typedef TextScanner<CStringIterator,SrcCharSet> SrcScanner; + +public: + ///\brief Constructor + XMLPathSelectAutomatonParser(){} + virtual ~XMLPathSelectAutomatonParser(){} + + int addExpression( int typeidx, const char* esrc, std::size_t esrcsize) + { + std::string idstrings; + CStringIterator pitr( esrc, esrcsize); + SrcScanner pp( m_srccharset, pitr); + std::vector<std::size_t> idref; + + for (; *pp; skipSpaces( pp)) + { + switch (*pp) + { + case '/': + case '@': + ++pp; + continue; + case '[': + while (*pp != 0 && *pp != ']') pp++; + if (*pp == 0) return pp.getPosition()+1; + ++pp; + continue; + default: + if (pp.control() == Undef || pp.control() == Any) + { + idref.push_back( parseIdentifier( pp, idstrings)); + } + else + { + return pp.getPosition()+1; + } + } + } + typename std::vector<std::size_t>::const_iterator di = idref.begin(), de = idref.end(); + + CStringIterator itr( esrc, esrcsize); + SrcScanner src( m_srccharset, itr); + PathElement expr( this); + + for (; *src; skipSpaces( src)) + { + switch (*src) + { + case '@': + { + if (di == de) return src.getPosition()+1; + ++src; + skipIdentifier( src); + expr( getIdentifier( *di, idstrings) ); + } + case '/': + { + ++src; + if (*src == '/') + { + ++src; + if (*src == '@') + { + if (di == de) return src.getPosition()+1; + ++src; + skipIdentifier( src); + expr -- ( getIdentifier( *di, idstrings) ); + } + else + { + if (di == de) return src.getPosition()+1; + skipIdentifier( src); + expr -- [ getIdentifier( *di, idstrings) ]; + } + } + else + { + if (*src == '@') + { + if (di == de) return src.getPosition()+1; + ++src; + skipIdentifier( src); + expr ( getIdentifier( *di, idstrings) ); + } + else + { + if (di == de) return src.getPosition()+1; + skipIdentifier( src); + expr [ getIdentifier( *di, idstrings) ]; + } + } + continue; + } + case '[': + { + // Range + int range_start = -1; + int range_end = -1; + ++src; skipSpaces( src); + range_start = parseNum( src); + if (range_start < 0) return src.getPosition()+1; + skipSpaces( src); + + if (*src == ',') + { + ++src; skipSpaces( src); + if (*src == ']') + { + expr.FROM( range_start); + ++src; + } + else + { + range_end = parseNum( src); + if (range_end < 0) return src.getPosition()+1; + ++src; skipSpaces( src); + if (*src != ']') return src.getPosition()+1; + expr.RANGE( range_start, range_end); + ++src; + } + } + else if (*src == ']') + { + range_start = range_end; + expr.INDEX( range_start); + ++src; + } + else + { + return src.getPosition()+1; + } + continue; + } + default: + return src.getPosition()+1; + } + } + expr.assignType( typeidx); + return 0; + } + +private: + static void skipSpaces( SrcScanner& src) + { + for (; src.control() == Space; ++src); + } + + static int parseNum( SrcScanner& src) + { + std::string num; + for (; *src>='0' && *src<='9';++src) num.push_back( *src); + if (num.size() == 0 || num.size() > 8) return -1; + return std::atoi( num.c_str()); + } + + std::size_t parseIdentifier( SrcScanner& src, std::string& idstrings) + { + std::size_t rt = idstrings.size(); + for (; src.control() == Undef || src.control() == Any; ++src) + { + m_atmcharset.print( *src, idstrings); + } + m_atmcharset.print( 0, idstrings); + return rt; + } + + static void skipIdentifier( SrcScanner& src) + { + for (; src.control() == Undef || src.control() == Any; ++src); + } + + const char* getIdentifier( std::size_t idx, const std::string& idstrings) const + { + return idstrings.c_str() + idx; + } + +private: + AtmCharSet m_atmcharset; + SrcCharSet m_srccharset; +}; + +} //namespace +#endif diff --git a/textwolf/include/textwolf/xmlpathselect.hpp b/textwolf/include/textwolf/xmlpathselect.hpp new file mode 100644 index 0000000..a57969e --- /dev/null +++ b/textwolf/include/textwolf/xmlpathselect.hpp @@ -0,0 +1,516 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/xmlpathselect.hpp +/// \brief Context of running automaton selecting path expressions from an XML iterator + +#ifndef __TEXTWOLF_XML_PATH_SELECT_HPP__ +#define __TEXTWOLF_XML_PATH_SELECT_HPP__ +#include "textwolf/char.hpp" +#include "textwolf/charset_interface.hpp" +#include "textwolf/exception.hpp" +#include "textwolf/xmlscanner.hpp" +#include "textwolf/staticbuffer.hpp" +#include "textwolf/xmlpathautomaton.hpp" +#include <limits> +#include <string> +#include <vector> +#include <map> +#include <cstddef> + +namespace textwolf { + +/// \brief XML path select template +/// \tparam CharSet_ character set encoding of the automaton elements +template <class CharSet_> +class XMLPathSelect :public throws_exception +{ +public: + typedef XMLPathSelectAutomaton<CharSet_> ThisXMLPathSelectAutomaton; + typedef XMLPathSelect<CharSet_> ThisXMLPathSelect; + +private: + const ThisXMLPathSelectAutomaton* atm; //< XML select automaton + typedef typename ThisXMLPathSelectAutomaton::Mask Mask; + typedef typename ThisXMLPathSelectAutomaton::Token Token; + typedef typename ThisXMLPathSelectAutomaton::Hash Hash; + typedef typename ThisXMLPathSelectAutomaton::State State; + typedef typename ThisXMLPathSelectAutomaton::Scope Scope; + + /// \class Array + /// \brief static array of POD types. I decided to implement it on my own though using boost::array would maybe be better. + /// \tparam Element element type of the array + template <typename Element> + class Array :public throws_exception + { + Element* m_ar; //< pointer to elements + std::size_t m_size; //< fill size (number of elements inserted) + std::size_t m_maxSize; //< allocation size (space reserved for this number of elements) + public: + /// \brief Constructor + /// \param [in] p_maxSize allocation size (number of elements) to reserve + Array( std::size_t p_maxSize) :m_size(0),m_maxSize(p_maxSize) + { + m_ar = new (std::nothrow) Element[ m_maxSize]; + if (m_ar == 0) throw exception( OutOfMem); + } + + /// \brief Destructor + ~Array() + { + if (m_ar) delete [] m_ar; + } + + /// \brief Append one element + /// \param [in] elem element to append + void push_back( const Element& elem) + { + if (m_size == m_maxSize) throw exception( OutOfMem); + m_ar[ m_size++] = elem; + } + + /// \brief Remove one element from the end + void pop_back() + { + if (m_size == 0) throw exception( NotAllowedOperation); + m_size--; + } + + /// \brief Access element by index + /// \param [in] idx index of the element starting with 0 + /// \return element reference + Element& operator[]( std::size_t idx) + { + if (idx >= m_size) throw exception( ArrayBoundsReadWrite); + return m_ar[ idx]; + } + + /// \brief Get a reference of the element at the end of the array + /// \return element reference + Element& back() + { + if (m_size == 0) throw exception( ArrayBoundsReadWrite); + return m_ar[ m_size-1]; + } + + /// \brief Resize of the array + /// \param [in] p_size new array size + void resize( std::size_t p_size) + { + if (p_size > m_size) throw exception( ArrayBoundsReadWrite); + m_size = p_size; + } + std::size_t size() const {return m_size;} + bool empty() const {return m_size==0;} + }; + + /// \class Context + /// \brief State variables without stacks of the automaton + struct Context + { + XMLScannerBase::ElementType type; //< element type processed + const char* key; //< string value of element processed + unsigned int keysize; //< size of string value in bytes of element processed + Scope scope; //< active scope + unsigned int scope_iter; //< position of currently visited token in the active scope + + /// \brief Constructor + Context() :type(XMLScannerBase::Content),key(0),keysize(0) {} + + /// \brief Initialization + /// \param [in] p_type type of the current element processed + /// \param [in] p_key current element processed + /// \param [in] p_keysize size of the key in bytes + void init( XMLScannerBase::ElementType p_type, const char* p_key, int p_keysize) + { + type = p_type; + key = p_key; + keysize = p_keysize; + scope_iter = scope.range.tokenidx_from; + } + }; + + Array<Scope> scopestk; //< stack of scopes opened + Array<unsigned int> follows; //< indices of tokens active in all descendant scopes + Array<int> triggers; //< triggered elements + Array<Token> tokens; //< list of waiting tokens + Context context; //< state variables without stacks of the automaton + + /// \brief Activate a state by index + /// \param stateidx index of the state to activate + void expand( int stateidx) + { + while (stateidx!=-1) + { + const State& st = atm->states[ stateidx]; + context.scope.mask.join( st.core.mask); + if (st.core.mask.empty() && st.core.typeidx != 0) + { + triggers.push_back( st.core.typeidx); + } + else + { + if (st.core.follow) + { + context.scope.followMask.join( st.core.mask); + follows.push_back( tokens.size()); + } + tokens.push_back( Token( st, stateidx)); + } + stateidx = st.link; + } + } + + /// \brief Declares the currently processed element of the XMLScanner input. By calling fetch we get the output elements from it + /// \param [in] type type of the current element processed + /// \param [in] key current element processed + /// \param [in] keysize size of the key in bytes + void initProcessElement( XMLScannerBase::ElementType type, const char* key, int keysize) + { + if (context.type == XMLScannerBase::OpenTag) + { + //last step of open scope has to be done after all tokens were visited, + //e.g. with the next element initialization + context.scope.range.tokenidx_from = context.scope.range.tokenidx_to; + } + context.scope.range.tokenidx_to = tokens.size(); + context.scope.range.followidx = follows.size(); + context.init( type, key, keysize); + + if (type == XMLScannerBase::OpenTag) + { + //first step of open scope saves the context context on stack + scopestk.push_back( context.scope); + context.scope.mask = context.scope.followMask; + context.scope.mask.match( XMLScannerBase::OpenTag); + //... we reset the mask but ensure that this 'OpenTag' is processed for sure + } + else if (type == XMLScannerBase::CloseTag || type == XMLScannerBase::CloseTagIm) + { + if (!scopestk.empty()) + { + context.scope = scopestk.back(); + scopestk.pop_back(); + follows.resize( context.scope.range.followidx); + tokens.resize( context.scope.range.tokenidx_to); + } + } + } + + /// \brief produce an element adressed by token index + /// \param [in] tokenidx index of the token in the list of active tokens + /// \param [in] st state from which the expand was triggered + void produce( unsigned int tokenidx, const State& st) + { + const Token& tk = tokens[ tokenidx]; + if (tk.core.cnt_end == -1) + { + expand( st.next); + } + else + { + if (tk.core.cnt_end > 0) + { + if (--tokens[ tokenidx].core.cnt_end == 0) + { + tokens[ tokenidx].core.mask.reset(); + } + if (tk.core.cnt_start <= 0) + { + expand( st.next); + } + else + { + --tokens[ tokenidx].core.cnt_start; + } + } + } + } + + /// \brief check if an active token addressed by index matches to the currently processed element + /// \param [in] tokenidx index of the token in the list of active tokens + /// \return matching token type + int match( unsigned int tokenidx) + { + int rt = 0; + if (context.key != 0) + { + if (tokenidx >= context.scope.range.tokenidx_to) return 0; + + const Token& tk = tokens[ tokenidx]; + if (tk.core.mask.matches( context.type)) + { + const State& st = atm->states[ tk.stateidx]; + if (st.key) + { + if (st.keysize == context.keysize) + { + unsigned int ii; + for (ii=0; ii<context.keysize && st.key[ii] == context.key[ii]; ii++); + if (ii==context.keysize) + { + produce( tokenidx, st); + } + } + } + else + { + produce( tokenidx, st); + } + if (tk.core.typeidx != 0) + { + if (tk.core.cnt_end == -1) + { + rt = tk.core.typeidx; + } + else if (tk.core.cnt_end > 0) + { + if (--tokens[ tokenidx].core.cnt_end == 0) + { + tokens[ tokenidx].core.mask.reset(); + } + if (tk.core.cnt_start <= 0) + { + rt = tk.core.typeidx; + } + else + { + --tokens[ tokenidx].core.cnt_start; + } + } + } + } + if (tk.core.mask.rejects( context.type)) + { + //The token must not match anymore after encountering a reject item + tokens[ tokenidx].core.mask.reset(); + } + } + return rt; + } + + /// \brief fetch the next matching element + /// \return type of the matching element + int fetch() + { + int type = 0; + + if (context.scope.mask.matches( context.type)) + { + while (!type) + { + if (context.scope_iter < context.scope.range.tokenidx_to) + { + type = match( context.scope_iter); + ++context.scope_iter; + } + else + { + unsigned int ii = context.scope_iter - context.scope.range.tokenidx_to; + //we match all follows that are not yet been checked in the current scope + if (ii < context.scope.range.followidx && context.scope.range.tokenidx_from > follows[ ii]) + { + type = match( follows[ ii]); + ++context.scope_iter; + } + else if (!triggers.empty()) + { + type = triggers.back(); + triggers.pop_back(); + } + else + { + context.key = 0; + context.keysize = 0; + return 0; //end of all candidates + } + } + } + } + else + { + context.key = 0; + context.keysize = 0; + } + return type; + } + +public: + /// \brief Constructor + /// \param[in] p_atm read only ML path select automaton reference + XMLPathSelect( const ThisXMLPathSelectAutomaton* p_atm) + :atm(p_atm),scopestk(p_atm->maxScopeStackSize),follows(p_atm->maxFollows),triggers(p_atm->maxTriggers),tokens(p_atm->maxTokens) + { + if (atm->states.size() > 0) expand(0); + } + + /// \brief Copy constructor + /// \param [in] o element to copy + XMLPathSelect( const XMLPathSelect& o) + :atm(o.atm),scopestk(o.scopestk),follows(o.follows),triggers(o.triggers),tokens(o.tokens){} + + /// \class iterator + /// \brief input iterator for the output of this XMLScanner + class iterator + { + public: + typedef int value_type; + typedef std::size_t difference_type; + typedef int* pointer; + typedef int& reference; + typedef std::input_iterator_tag iterator_category; + + private: + int element; //< currently visited element (type) + ThisXMLPathSelect* input; //< producing XML path selection stream + + /// \brief Skip to next element + /// \return *this + iterator& skip() throw(exception) + { + if (input != 0) + { + element = input->fetch(); + } + else + { + element = 0; + } + return *this; + } + + /// \brief Iterator compare + /// \param [in] iter iterator to compare with + /// \return true, if the elements are equal + bool compare( const iterator& iter) const + { + return (element == iter.element); + } + + public: + /// \brief Assign iterator + /// \param [in] orig iterator to copy + void assign( const iterator& orig) + { + input = orig.input; + element = orig.element; + } + + /// \brief Copy constructor + /// \param [in] orig iterator to copy + iterator( const iterator& orig) + { + assign( orig); + } + + /// \brief Constructor by values + /// \param [in] p_input XML path selection stream to iterate through + /// \param [in] p_type XML element type to feed to XML path matcher + /// \param [in] p_key XML element value reference to feed to XML path matcher + /// \param [in] p_keysize XML element value size in bytes to feed to XML path matcher + iterator( ThisXMLPathSelect& p_input, XMLScannerBase::ElementType p_type, const char* p_key, int p_keysize) + :input( &p_input) + { + input->initProcessElement( p_type, p_key, p_keysize); + skip(); + } + + /// \brief Default constructor + iterator() + :element(0),input(0) {} + + /// \brief Assignement + /// \param [in] orig iterator to copy + /// \return *this + iterator& operator = (const iterator& orig) + { + assign( orig); + return *this; + } + + /// \brief Element acceess + /// \return read only element reference + int operator*() const + { + return element; + } + + /// \brief Element acceess + /// \return read only element reference + const int* operator->() const + { + return &element; + } + + /// \brief Preincrement + /// \return *this + iterator& operator++() {return skip();} + + /// \brief Postincrement + /// \return *this + iterator operator++(int) {iterator tmp(*this); skip(); return tmp;} + + /// \brief Compare elements for equality + /// \return true, if they are equal + bool operator==( const iterator& iter) const {return compare( iter);} + + /// \brief Compare elements for inequality + /// \return true, if they are not equal + bool operator!=( const iterator& iter) const {return !compare( iter);} + }; + + /// \brief Feed the path selector with the next token and get the start iterator for the results + /// \return iterator pointing to the first of the selected XML path elements + iterator push( XMLScannerBase::ElementType type, const char* key, int keysize) + { + return iterator( *this, type, key, keysize); + } + + /// \brief Feed the path selector with the next token and get the start iterator for the results + /// \return iterator pointing to the first of the selected XML path elements + iterator push( XMLScannerBase::ElementType type, const std::string& key) + { + return iterator( *this, type, key.c_str(), key.size); + } + + /// \brief Get the end of results returned by 'push(XMLScannerBase::ElementType,const char*, int)' + /// \return the end iterator + iterator end() + { + return iterator(); + } +}; + +}//namespace +#endif diff --git a/textwolf/include/textwolf/xmlprinter.hpp b/textwolf/include/textwolf/xmlprinter.hpp new file mode 100644 index 0000000..acf7bd4 --- /dev/null +++ b/textwolf/include/textwolf/xmlprinter.hpp @@ -0,0 +1,387 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this Object refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/xmlprinter.hpp +/// \brief XML printer interface hiding character encoding properties + +#ifndef __TEXTWOLF_XML_PRINTER_HPP__ +#define __TEXTWOLF_XML_PRINTER_HPP__ +#include "textwolf/cstringiterator.hpp" +#include "textwolf/textscanner.hpp" +#include "textwolf/xmlscanner.hpp" +#include "textwolf/charset.hpp" +#include "textwolf/xmltagstack.hpp" +#include <cstring> +#include <cstdlib> + +/// \namespace textwolf +/// \brief Toplevel namespace of the library +namespace textwolf { + +/// \class XMLPrinter +/// \brief Character encoding dependent XML printer +/// \tparam IOCharset Character set encoding of input and output +/// \tparam AppCharset Character set encoding of the application processor +/// \tparam BufferType STL back insertion sequence to use for printing output +template <class IOCharset, class AppCharset,class BufferType> +class XMLPrinter +{ +private: + /// \brief Prints a character string to an STL back insertion sequence buffer in the IO character set encoding + /// \param [in] src pointer to string to print + /// \param [in] srcsize size of src in bytes + /// \param [out] buf buffer to append result to + void printToBuffer( const char* src, std::size_t srcsize, BufferType& buf) const + { + CStringIterator itr( src, srcsize); + TextScanner<CStringIterator,AppCharset> ts( itr); + + UChar ch; + while ((ch = ts.chr()) != 0) + { + m_output.print( ch, buf); + ++ts; + } + } + + /// \brief print a character substitute or the character itself + /// \param [in] ch character to print + /// \param [in,out] buf buffer to print to + /// \param [in] nof_echr number of elements in echr and estr + /// \param [in] echr ASCII characters to substitute + /// \param [in] estr ASCII strings to substitute with (array parallel to echr) + void printEsc( char ch, BufferType& buf, unsigned int nof_echr, const char* echr, const char** estr) const + { + const char* cc = (const char*)memchr( echr, ch, nof_echr); + if (cc) + { + unsigned int ii = 0; + const char* tt = estr[ cc-echr]; + while (tt[ii]) m_output.print( tt[ii++], buf); + } + else + { + m_output.print( ch, buf); + } + } + + /// \brief print a value with some characters replaced by a string + /// \param [in] src pointer to attribute value string to print + /// \param [in] srcsize size of src in bytes + /// \param [in,out] buf buffer to print to + /// \param [in] nof_echr number of elements in echr and estr + /// \param [in] echr ASCII characters to substitute + /// \param [in] estr ASCII strings to substitute with (array parallel to echr) + void printToBufferSubstChr( const char* src, std::size_t srcsize, BufferType& buf, unsigned int nof_echr, const char* echr, const char** estr) const + { + CStringIterator itr( src, srcsize); + textwolf::TextScanner<CStringIterator,AppCharset> ts( itr); + + textwolf::UChar ch; + while ((ch = ts.chr()) != 0) + { + if (ch < 128) + { + printEsc( (char)ch, buf, nof_echr, echr, estr); + } + else + { + m_output.print( ch, buf); + } + ++ts; + } + } + + /// \brief print attribute value string + /// \param [in] src pointer to attribute value string to print + /// \param [in] srcsize size of src in bytes + /// \param [in,out] buf buffer to print to + void printToBufferAttributeValue( const char* src, std::size_t srcsize, BufferType& buf) const + { + enum {nof_echr = 12}; + static const char* estr[nof_echr] = {"<", ">", "'", """, "&", "�", "", "	", " ", " "}; + static const char echr[nof_echr+1] = "<>'\"&\0\b\t\n\r"; + m_output.print( '"', buf); + printToBufferSubstChr( src, srcsize, buf, nof_echr, echr, estr); + m_output.print( '"', buf); + } + + /// \brief print content value string + /// \param [in] src pointer to content string to print + /// \param [in] srcsize size of src in bytes + /// \param [in,out] buf buffer to print to + void printToBufferContent( const char* src, std::size_t srcsize, BufferType& buf) const + { + enum {nof_echr = 6}; + static const char* estr[nof_echr] = {"<", ">", "&", "�", ""}; + static const char echr[nof_echr+1] = "<>&\0\b"; + printToBufferSubstChr( src, srcsize, buf, nof_echr, echr, estr); + } + + /// \brief Prints a character to an STL back insertion sequence buffer in the IO character set encoding + /// \param [in] ch character to print + /// \param [in,out] buf buffer to print to + void printToBuffer( char ch, BufferType& buf) const + { + m_output.print( (textwolf::UChar)(unsigned char)ch, buf); + } + +public: + /// \brief Default constructor + XMLPrinter() + :m_state(Init){} + + /// \brief Constructor + explicit XMLPrinter( const IOCharset& output_) + :m_state(Init),m_output(output_){} + + /// \brief Copy constructor + XMLPrinter( const XMLPrinter& o) + :m_state(o.m_state),m_buf(o.m_buf),m_tagstack(o.m_tagstack),m_output(o.m_output) + {} + + /// \brief Prints an XML header (version "1.0") + /// \param [in] encoding character set encoding name + /// \param [in] standalone standalone attribute ("yes","no" or NULL for undefined) + /// \param [out] buf buffer to print to + /// \return true on success, false if failed (check lasterror()) + bool printHeader( const char* encoding, const char* standalone, BufferType& buf) + { + if (m_state != Init) + { + m_lasterror = "printing document not starting with xml header"; + return false; + } + std::string enc = encoding?encoding:"UTF-8"; + printToBuffer( "<?xml version=\"1.0\" encoding=\"", 30, buf); + printToBuffer( enc.c_str(), enc.size(), buf); + if (standalone) + { + printToBuffer( "\" standalone=\"", 14, buf); + printToBuffer( standalone, std::strlen(standalone), buf); + printToBuffer( "\"?>\n", 4, buf); + } + else + { + printToBuffer( "\"?>\n", 4, buf); + } + m_state = Content; + return true; + } + + /// \brief Prints an XML <!DOCTYPE ...> declaration + /// \param [in] rootid root element name + /// \param [in] publicid PUBLIC attribute + /// \param [in] systemid SYSTEM attribute + /// \param [out] buf buffer to print to + /// \return true on success, false if failed (check lasterror()) + bool printDoctype( const char* rootid, const char* publicid, const char* systemid, BufferType& buf) + { + if (rootid) + { + if (publicid) + { + if (!systemid) + { + m_lasterror = "defined DOCTYPE with PUBLIC id but no SYSTEM id"; + return false; + } + printToBuffer( "<!DOCTYPE ", 10, buf); + printToBuffer( rootid, std::strlen( rootid), buf); + printToBuffer( " PUBLIC \"", 9, buf); + printToBuffer( publicid, std::strlen( publicid), buf); + printToBuffer( "\" \"", 3, buf); + printToBuffer( systemid, std::strlen( systemid), buf); + printToBuffer( "\">", 2, buf); + } + else if (systemid) + { + printToBuffer( "<!DOCTYPE ", 10, buf); + printToBuffer( rootid, std::strlen( rootid), buf); + printToBuffer( " SYSTEM \"", 9, buf); + printToBuffer( systemid, std::strlen( systemid), buf); + printToBuffer( "\">", 2, buf); + } + else + { + printToBuffer( "<!DOCTYPE ", 11, buf); + printToBuffer( rootid, std::strlen( rootid), buf); + printToBuffer( ">", 2, buf); + } + } + return true; + } + + /// \brief Close the current tag attribute context opened + /// \param [out] buf buffer to print to + /// \return true on success, false if failed (check lasterror()) + bool exitTagContext( BufferType& buf) + { + if (m_state != Content) + { + if (m_state == Init) + { + m_lasterror = "printed xml without root element"; + return false; + } + printToBuffer( '>', buf); + m_state = Content; + } + return true; + } + + /// \brief Print the start of an open tag + /// \param [in] src start of the tag name + /// \param [in] srcsize length of the tag name in bytes + /// \param [out] buf buffer to print to + /// \return true on success, false if failed (check lasterror()) + bool printOpenTag( const char* src, std::size_t srcsize, BufferType& buf) + { + if (!exitTagContext( buf)) return false; + printToBuffer( '<', buf); + printToBuffer( (const char*)src, srcsize, buf); + + m_tagstack.push( src, srcsize); + m_state = TagElement; + return true; + } + + /// \brief Print the start of an attribute name + /// \param [in] src start of the attribute name + /// \param [in] srcsize length of the attribute name in bytes + /// \param [out] buf buffer to print to + /// \return true on success, false if failed (check lasterror()) + bool printAttribute( const char* src, std::size_t srcsize, BufferType& buf) + { + if (m_state == TagElement) + { + printToBuffer( ' ', buf); + printToBuffer( (const char*)src, srcsize, buf); + printToBuffer( '=', buf); + m_state = TagAttribute; + return true; + } + return false; + } + + /// \brief Print a content or attribute value depending on context + /// \param [in] src start of the value + /// \param [in] srcsize length of the value in bytes + /// \param [out] buf buffer to print to + /// \return true on success, false if failed (check lasterror()) + bool printValue( const char* src, std::size_t srcsize, BufferType& buf) + { + if (m_state == TagAttribute) + { + printToBufferAttributeValue( (const char*)src, srcsize, buf); + m_state = TagElement; + } + else + { + if (!exitTagContext( buf)) return false; + printToBufferContent( (const char*)src, srcsize, buf); + } + return true; + } + + /// \brief Print the close of the current tag open + /// \param [out] buf buffer to print to + /// \return true on success, false if failed (check lasterror()) + bool printCloseTag( BufferType& buf) + { + const void* cltag; + std::size_t cltagsize; + + if (!m_tagstack.top( cltag, cltagsize) || !cltagsize) + { + return false; + } + if (m_state == TagElement) + { + printToBuffer( '/', buf); + printToBuffer( '>', buf); + m_state = Content; + } + else if (m_state != Content) + { + return false; + } + else + { + printToBuffer( '<', buf); + printToBuffer( '/', buf); + printToBuffer( (const char*)cltag, cltagsize, buf); + printToBuffer( '>', buf); + } + m_tagstack.pop(); + if (m_tagstack.empty()) + { + printToBuffer( '\n', buf); + } + return true; + } + + /// \brief Internal state + enum State + { + Init, + Content, + TagAttribute, + TagElement + }; + + /// \brief Get the current internal state + /// \return the current state + State state() const + { + return m_state; + } + + /// \brief Get the last error occurred + /// \return the last error string + const char* lasterror() const + { + return m_lasterror.empty()?0:m_lasterror.c_str(); + } + +private: + State m_state; ///< internal state + BufferType m_buf; ///< element output buffer + TagStack m_tagstack; ///< tag name stack of open tags + IOCharset m_output; ///< output character set encoding + std::string m_lasterror; ///< the last error occurred +}; + +} //namespace +#endif diff --git a/textwolf/include/textwolf/xmlscanner.hpp b/textwolf/include/textwolf/xmlscanner.hpp new file mode 100644 index 0000000..9018816 --- /dev/null +++ b/textwolf/include/textwolf/xmlscanner.hpp @@ -0,0 +1,1355 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/xmlscanner.hpp +/// \brief XML parser iterator interface for processing the XML elements one by one + +#ifndef __TEXTWOLF_XML_SCANNER_HPP__ +#define __TEXTWOLF_XML_SCANNER_HPP__ +#include "textwolf/char.hpp" +#include "textwolf/charset_interface.hpp" +#include "textwolf/exception.hpp" +#include "textwolf/textscanner.hpp" +#include "textwolf/traits.hpp" +#include <map> +#include <cstddef> + +namespace textwolf { + +/// \class ScannerStatemachine +/// \brief Class to build up the XML element scanner state machine in a descriptive way +class ScannerStatemachine :public throws_exception +{ +public: + enum + { + MaxNofStates=64 ///< maximum number of states (fixed allocated array for state machine) + }; + /// \class Element + /// \brief One state in the state machine + struct Element + { + int fallbackState; ///< state transition if the event does not match (it belongs to the next state = fallbackState) + int missError; ///< error code in case of an event that does not match and there is no fallback + + /// \class Action + /// \brief Definition of action fired by the state machine + struct Action + { + int op; ///< action operand + int arg; ///< action argument + }; + Action action; ///< action executed after entering this state + unsigned char nofnext; ///< number of follow states defined + signed char next[ NofControlCharacter]; ///< follow state fired by an event (control character type parsed) + + /// \brief Constructor + Element() :fallbackState(-1),missError(-1),nofnext(0) + { + action.op = -1; + action.arg = 0; + for (unsigned int ii=0; ii<NofControlCharacter; ii++) next[ii] = -1; + } + }; + /// \brief Get state addressed by its index + /// \param [in] stateIdx index of the state + /// \return state defintion reference + Element* get( int stateIdx) throw(exception) + { + if ((unsigned int)stateIdx>size) throw exception(InvalidState); + return tab + stateIdx; + } + +private: + Element tab[ MaxNofStates]; ///< states of the STM + unsigned int size; ///< number of states defined in the STM + + /// \brief Create a new state + /// \param [in] stateIdx index of the state (must be the size of the STM array, so that state identifiers can be named by enumeration constants for better readability) + void newState( int stateIdx) throw(exception) + { + if (size != (unsigned int)stateIdx) throw exception( StateNumbersNotAscending); + if (size >= MaxNofStates) throw exception( DimOutOfRange); + size++; + } + + /// \brief Define a transition for all control character types not firing yet in the last state defined + /// \param [in] nextState the follow state index defined for these transitions + void addOtherTransition( int nextState) throw(exception) + { + if (size == 0) throw exception( InvalidState); + if (nextState < 0 || nextState > MaxNofStates) throw exception( InvalidParamState); + for (unsigned int inputchr=0; inputchr<NofControlCharacter; inputchr++) + { + if (tab[ size-1].next[ inputchr] == -1) tab[ size-1].next[ inputchr] = (unsigned char)nextState; + } + tab[ size-1].nofnext = NofControlCharacter; + } + + /// \brief Define a transition for inputchr in the last state defined + /// \param [in] inputchr the firing input control character type + /// \param [in] nextState the follow state index defined for this transition + void addTransition( ControlCharacter inputchr, int nextState) throw(exception) + { + if (size == 0) throw exception( InvalidState); + if ((int)inputchr >= (int)NofControlCharacter) throw exception( InvalidParamChar); + if (nextState < 0 || nextState > MaxNofStates) throw exception( InvalidParamState); + if (tab[ size-1].next[ inputchr] != -1) throw exception( DuplicateStateTransition); + tab[ size-1].next[ inputchr] = (unsigned char)nextState; + tab[ size-1].nofnext += 1; + } + + /// \brief Define a self directing transition for inputchr in the last state defined (the state remains the same for this input) + /// \param [in] inputchr the firing input control character type + void addTransition( ControlCharacter inputchr) throw(exception) + { + addTransition( inputchr, size-1); + } + + /// \brief Define an action in the last state defined (to be executed when entering the state) + /// \param [in] action_op action operand + /// \param [in] action_arg action argument + void addAction( int action_op, int action_arg=0) throw(exception) + { + if (size == 0) throw exception( InvalidState); + if (tab[ size-1].action.op != -1) throw exception( InvalidState); + tab[ size-1].action.op = action_op; + tab[ size-1].action.arg = action_arg; + } + + /// \brief Define an error in the last state defined to be reported when no fallback is defined and no firing input character parsed + /// \param [in] error code to be reported + void addMiss( int error) throw(exception) + { + if (size == 0) throw exception( InvalidState); + if (tab[ size-1].missError != -1) throw exception( InvalidState); + tab[ size-1].missError = error; + } + + /// \brief Define in the last state defined a fallback state transition that is fired when no firing input character parsed + /// \param [in] stateIdx follow state index + void addFallback( int stateIdx) throw(exception) + { + if (size == 0) throw exception( InvalidState); + if (tab[ size-1].fallbackState != -1) throw exception( InvalidState); + if (stateIdx < 0 || stateIdx > MaxNofStates) throw exception( InvalidParamState); + tab[ size-1].fallbackState = stateIdx; + } +public: + /// \brief Constructor + ScannerStatemachine() :size(0){} + + /// \brief See ScannerStatemachine::newState(int) + ScannerStatemachine& operator[]( int stateIdx) {newState(stateIdx); return *this;} + /// \brief See ScannerStatemachine::addTransition(ControlCharacter,int) + ScannerStatemachine& operator()( ControlCharacter inputchr, int ns) {addTransition(inputchr,ns); return *this;} + /// \brief See ScannerStatemachine::addTransition(ControlCharacter,int) + ScannerStatemachine& operator()( ControlCharacter i1, ControlCharacter i2, int ns) {addTransition(i1,ns); addTransition(i2,ns); return *this;} + /// \brief See ScannerStatemachine::addTransition(ControlCharacter,int) + ScannerStatemachine& operator()( ControlCharacter i1, ControlCharacter i2, ControlCharacter i3, int ns) {addTransition(i1,ns); addTransition(i2,ns); addTransition(i3,ns); return *this;} + /// \brief See ScannerStatemachine::addTransition(ControlCharacter) + ScannerStatemachine& operator()( ControlCharacter inputchr) {addTransition(inputchr); return *this;} + /// \brief See ScannerStatemachine::addAction(int,int) + ScannerStatemachine& action( int aa, int arg=0) {addAction(aa,arg); return *this;} + /// \brief See ScannerStatemachine::addMiss(int) + ScannerStatemachine& miss( int ee) {addMiss(ee); return *this;} + /// \brief See ScannerStatemachine::addFallback(int) + ScannerStatemachine& fallback( int stateIdx) {addFallback(stateIdx); return *this;} + /// \brief See ScannerStatemachine::addOtherTransition(int) + ScannerStatemachine& other( int stateIdx) {addOtherTransition(stateIdx); return *this;} +}; + +/// \class XMLScannerBase +/// \brief XML scanner base class for things common for all XML scanners +class XMLScannerBase +{ +public: + /// \enum ElementType + /// \brief Enumeration of XML element types returned by an XML scanner + enum ElementType + { + None, ///< empty (NULL) + ErrorOccurred, ///< XML scanning error error reported + HeaderStart, ///< open XML header tag + HeaderAttribName, ///< tag attribute name in the XML header + HeaderAttribValue, ///< tag attribute value in the XML header + HeaderEnd, ///< end of XML header event (after parsing '?>') + DocAttribValue, ///< document attribute value in a DOCTYPE or ENTITY definition + DocAttribEnd, ///< end of a document attribute definition <! .. !> + TagAttribName, ///< tag attribute name (e.g. "id" in <person id='5'> + TagAttribValue, ///< tag attribute value (e.g. "5" in <person id='5'> + OpenTag, ///< open tag (e.g. "bla" for "<bla...") + CloseTag, ///< close tag (e.g. "bla" for "</bla>") + CloseTagIm, ///< immediate close tag (e.g. "bla" for "<bla />") + Content, ///< content element string (separated by spaces or end of line) + Exit ///< end of document + }; + enum + { + NofElementTypes=Exit+1 ///< number of XML element types defined + }; + + /// \brief Get the XML element type as string + /// \param [in] ee XML element type + /// \return XML element type as string + static const char* getElementTypeName( ElementType ee) + { + static const char* names[ NofElementTypes] = {"None","ErrorOccurred","HeaderStart","HeaderAttribName","HeaderAttribValue","HeaderEnd", "DocAttribValue", "DocAttribEnd", "TagAttribName","TagAttribValue","OpenTag","CloseTag","CloseTagIm","Content","Exit"}; + return names[ (unsigned int)ee]; + } + + /// \enum Error + /// \brief Enumeration of XML scanner error codes + enum Error + { + Ok, ///< no error, everything is OK + ErrIllegalDocumentAttributeDef, ///< error in document attribute or entity definition + ErrExpectedOpenTag, ///< expected an open tag in this state + ErrExpectedXMLTag, ///< expected an <?xml tag in this state + ErrUnexpectedEndOfText, ///< unexpected end of text in the middle of the XML definition + ErrSyntaxToken, ///< a specific string expected as token in XML but does not match + ErrStringNotTerminated, ///< attribute string in XML not terminated on the same line + ErrUndefinedCharacterEntity, ///< named entity is not defined in the entity map + ErrExpectedTagEnd, ///< expected end of tag + ErrExpectedEqual, ///< expected equal in tag attribute definition + ErrExpectedTagAttribute, ///< expected tag attribute + ErrExpectedCDATATag, ///< expected CDATA tag definition + ErrInternal, ///< internal error (textwolf implementation error) + ErrUnexpectedEndOfInput, ///< unexpected end of input stream + ErrExpectedEndOfLine, ///< expected mandatory end of line (after XML header) + ErrExpectedDash2 ///< expected second '-' after '<!-' to start an XML comment as '<!-- ... -->' + }; + + /// \brief Get the error code as string + /// \param [in] ee error code + /// \return the error code as string + static const char* getErrorString( Error ee) + { + enum {NofErrors=16}; + static const char* sError[NofErrors] + = {0,"illegal document attribute definition", + "expected open tag", + "expected XML tag", + "unexpected end of text", + "syntax token", + "string not terminated", + "undefined character entity", + "expected tag end", + "expected equal", + "expected tag attribute", + "expected CDATA tag", + "internal", + "unexpected end of input", + "expected end of line", + "expected 2nd '-' to complete marker for start of comment '<!--'" + }; + return sError[(unsigned int)ee]; + } + + /// \enum STMState + /// \brief Enumeration of states of the XML scanner state machine + enum STMState + { + START, STARTTAG, XTAG, PITAG, PITAGEND, XTAGEND, XTAGDONE, XTAGAISK, XTAGANAM, XTAGAESK, XTAGAVSK, XTAGAVID, XTAGAVSQ, XTAGAVDQ, XTAGAVQE, + DOCSTART, CONTENT, TOKEN, SEEKTOK, XMLTAG, OPENTAG, CLOSETAG, TAGCLSK, TAGAISK, TAGANAM, TAGAESK, TAGAVSK, TAGAVID, TAGAVSQ, TAGAVDQ, TAGAVQE, + TAGCLIM, ENTITYSL, ENTITY, ENTITYE, ENTITYID, ENTITYSQ, ENTITYDQ, ENTITYLC, + COMDASH2, COMSEEKE, COMENDD2, COMENDCL, CDATA, CDATA1, CDATA2, CDATA3, EXIT + }; + + /// \brief Get the scanner state machine state as string + /// \param [in] s the state + /// \return the state as string + static const char* getStateString( STMState s) + { + enum Constant {NofStates=48}; + static const char* sState[NofStates] + = { + "START", "STARTTAG", "XTAG", "PITAG", "PITAGEND", + "XTAGEND", "XTAGDONE", "XTAGAISK", "XTAGANAM", + "XTAGAESK", "XTAGAVSK", "XTAGAVID", "XTAGAVSQ", "XTAGAVDQ", + "XTAGAVQE", "DOCSTART", "CONTENT", "TOKEN", "SEEKTOK", "XMLTAG", + "OPENTAG", "CLOSETAG", "TAGCLSK", "TAGAISK", "TAGANAM", + "TAGAESK", "TAGAVSK", "TAGAVID", "TAGAVSQ", "TAGAVDQ", + "TAGAVQE", "TAGCLIM", "ENTITYSL", "ENTITY", "ENTITYE", + "ENTITYID", "ENTITYSQ", "ENTITYDQ", "ENTITYLC", + "COMDASH2", "COMSEEKE", "COMENDD2", "COMENDCL", + "CDATA", "CDATA1", "CDATA2", "CDATA3", "EXIT" + }; + return sState[(unsigned int)s]; + } + + /// \enum STMAction + /// \brief Enumeration of actions in the XML scanner state machine + enum STMAction + { + Return, ReturnWord, ReturnContent, ReturnIdentifier, ReturnSQString, ReturnDQString, ExpectIdentifierXML, ExpectIdentifierCDATA, ReturnEOF, + NofSTMActions = 9 + }; + + /// \brief Get the scanner state machine action as string + /// \param [in] a the action + /// \return the action as string + static const char* getActionString( STMAction a) + { + static const char* name[ NofSTMActions] = {"Return", "ReturnWord", "ReturnContent", "ReturnIdentifier", "ReturnSQString", "ReturnDQString", "ExpectIdentifierXML", "ExpectIdentifierCDATA", "ReturnEOF"}; + return name[ (unsigned int)a]; + }; + + /// \class Statemachine + /// \brief XML scanner state machine implementation + struct Statemachine :public ScannerStatemachine + { + /// \brief Constructor (defines the state machine completely) + Statemachine() + { + (*this) + [ START ](EndOfText,EXIT)(EndOfLine)(Cntrl)(Space)(Lt,STARTTAG).miss(ErrExpectedOpenTag) + [ STARTTAG ](EndOfLine)(Cntrl)(Space)(Questm,XTAG)(Exclam,ENTITYSL).fallback(OPENTAG) + [ XTAG ].action(ExpectIdentifierXML)(EndOfLine,Cntrl,Space,XTAGAISK)(Questm,XTAGEND).miss(ErrExpectedXMLTag) + [ PITAG ](Questm,PITAGEND).other(PITAG) + [ PITAGEND ](Gt,CONTENT).miss(ErrExpectedTagEnd) + [ XTAGEND ](Gt,XTAGDONE)(EndOfLine)(Cntrl)(Space).miss(ErrExpectedTagEnd) + [ XTAGDONE ].action(Return,HeaderEnd).fallback(DOCSTART) + [ XTAGAISK ](EndOfLine)(Cntrl)(Space)(Questm,XTAGEND).fallback(XTAGANAM) + [ XTAGANAM ].action(ReturnIdentifier,HeaderAttribName)(EndOfLine,Cntrl,Space,XTAGAESK)(Equal,XTAGAVSK).miss(ErrExpectedEqual) + [ XTAGAESK ](EndOfLine)(Cntrl)(Space)(Equal,XTAGAVSK).miss(ErrExpectedEqual) + [ XTAGAVSK ](EndOfLine)(Cntrl)(Space)(Sq,XTAGAVSQ)(Dq,XTAGAVDQ).fallback(XTAGAVID) + [ XTAGAVID ].action(ReturnIdentifier,HeaderAttribValue)(EndOfLine,Cntrl,Space,XTAGAISK)(Questm,XTAGEND).miss(ErrExpectedTagAttribute) + [ XTAGAVSQ ].action(ReturnSQString,HeaderAttribValue)(Sq,XTAGAVQE).miss(ErrStringNotTerminated) + [ XTAGAVDQ ].action(ReturnDQString,HeaderAttribValue)(Dq,XTAGAVQE).miss(ErrStringNotTerminated) + [ XTAGAVQE ](EndOfLine,Cntrl,Space,XTAGAISK)(Questm,XTAGEND).miss(ErrExpectedTagAttribute) + [ DOCSTART ](EndOfText,EXIT)(EndOfLine)(Cntrl)(Space)(Lt,XMLTAG).fallback(TOKEN) + [ CONTENT ](EndOfText,EXIT)(Lt,XMLTAG).fallback(TOKEN) + [ TOKEN ].action(ReturnContent,Content)(EndOfText,EXIT)(EndOfLine,Cntrl,Space,CONTENT)(Lt,XMLTAG).fallback(CONTENT) + [ SEEKTOK ](EndOfText,EXIT)(EndOfLine)(Cntrl)(Space)(Lt,XMLTAG).fallback(TOKEN) + [ XMLTAG ](EndOfLine)(Cntrl)(Space)(Questm,PITAG)(Exclam,ENTITYSL)(Slash,CLOSETAG).fallback(OPENTAG) + [ OPENTAG ].action(ReturnIdentifier,OpenTag)(EndOfLine,Cntrl,Space,TAGAISK)(Slash,TAGCLIM)(Gt,CONTENT).miss(ErrExpectedTagAttribute) + [ CLOSETAG ].action(ReturnIdentifier,CloseTag)(EndOfLine,Cntrl,Space,TAGCLSK)(Gt,CONTENT).miss(ErrExpectedTagEnd) + [ TAGCLSK ](EndOfLine)(Cntrl)(Space)(Gt,CONTENT).miss(ErrExpectedTagEnd) + [ TAGAISK ](EndOfLine)(Cntrl)(Space)(Gt,CONTENT)(Slash,TAGCLIM).fallback(TAGANAM) + [ TAGANAM ].action(ReturnIdentifier,TagAttribName)(EndOfLine,Cntrl,Space,TAGAESK)(Equal,TAGAVSK).miss(ErrExpectedEqual) + [ TAGAESK ](EndOfLine)(Cntrl)(Space)(Equal,TAGAVSK).miss(ErrExpectedEqual) + [ TAGAVSK ](EndOfLine)(Cntrl)(Space)(Sq,TAGAVSQ)(Dq,TAGAVDQ).fallback(TAGAVID) + [ TAGAVID ].action(ReturnIdentifier,TagAttribValue)(EndOfLine,Cntrl,Space,TAGAISK)(Slash,TAGCLIM)(Gt,CONTENT).miss(ErrExpectedTagAttribute) + [ TAGAVSQ ].action(ReturnSQString,TagAttribValue)(Sq,TAGAVQE).miss(ErrStringNotTerminated) + [ TAGAVDQ ].action(ReturnDQString,TagAttribValue)(Dq,TAGAVQE).miss(ErrStringNotTerminated) + [ TAGAVQE ](EndOfLine,Cntrl,Space,TAGAISK)(Slash,TAGCLIM)(Gt,CONTENT).miss(ErrExpectedTagAttribute) + [ TAGCLIM ].action(Return,CloseTagIm)(EndOfLine)(Cntrl)(Space)(Gt,CONTENT).miss(ErrExpectedTagEnd) + [ ENTITYSL ](Osb,CDATA)(Dash,COMDASH2).fallback(ENTITY) + [ ENTITY ](Gt,ENTITYE)(EndOfLine)(Cntrl)(Space)(Dq,ENTITYDQ)(Sq,ENTITYSQ)(Osb,ENTITYLC).fallback(ENTITYID) + [ ENTITYE ].action(Return,DocAttribEnd).fallback(SEEKTOK) + [ ENTITYID ].action(ReturnIdentifier,DocAttribValue)(EndOfLine,Cntrl,Space,ENTITY)(Gt,ENTITYE).miss(ErrIllegalDocumentAttributeDef) + [ ENTITYSQ ].action(ReturnSQString,DocAttribValue)(Sq,ENTITY).miss(ErrStringNotTerminated) + [ ENTITYDQ ].action(ReturnDQString,DocAttribValue)(Dq,ENTITY).miss(ErrStringNotTerminated) + [ ENTITYLC ](Csb,ENTITY).other( ENTITYLC) + [ COMDASH2 ](Dash,COMSEEKE).miss(ErrExpectedDash2) + [ COMSEEKE ](Dash,COMENDD2).other(COMSEEKE) + [ COMENDD2 ](Dash,COMENDCL).other(COMSEEKE) + [ COMENDCL ](Gt,SEEKTOK)(Dash,COMENDD2).other(COMSEEKE) + [ CDATA ].action(ExpectIdentifierCDATA)(Osb,CDATA1).miss(ErrExpectedCDATATag) + [ CDATA1 ](Csb,CDATA2).other(CDATA1) + [ CDATA2 ](Csb,CDATA3).other(CDATA1) + [ CDATA3 ](Gt,CONTENT).other(CDATA1) + [ EXIT ].action(Return,Exit); + } + }; + + /// \typedef IsTokenCharMap + /// \brief Forms a set of characters by assigning (true/false) to the whole domain + typedef CharMap<bool,false,NofControlCharacter> IsTokenCharMap; + + /// \class IsTagCharMap + /// \brief Defines the set of tag characters + struct IsTagCharMap :public IsTokenCharMap + { + IsTagCharMap() + { + (*this)(Undef,true)(Any,true)(Dash,true); + } + }; + + /// \class IsWordCharMap + /// \brief Defines the set of content word characters (for tokenization) + /// \deprecated automatic tokenization with whitespace separators option not provided anymore + struct IsWordCharMap :public IsTokenCharMap + { + IsWordCharMap() + { + (*this)(Undef,true)(Equal,true)(Gt,true)(Slash,true)(Dash,true)(Exclam,true)(Questm,true)(Sq,true)(Dq,true)(Osb,true)(Csb,true)(Any,true); + } + }; + + /// \class IsContentCharMap + /// \brief Defines the set of content token characters + struct IsContentCharMap :public IsTokenCharMap + { + IsContentCharMap() + { + (*this)(Cntrl,true)(Space,true)(EndOfLine,true)(Undef,true)(Equal,true)(Gt,true)(Slash,true)(Dash,true)(Exclam,true)(Questm,true)(Sq,true)(Dq,true)(Osb,true)(Csb,true)(Any,true); + } + }; + + /// \class IsSQStringCharMap + /// \brief Defines the set characters belonging to a single quoted string + struct IsSQStringCharMap :public IsContentCharMap + { + IsSQStringCharMap() + { + (*this)(Sq,false)(Space,true); + } + }; + + /// \class IsDQStringCharMap + /// \brief Defines the set characters belonging to a double quoted string + struct IsDQStringCharMap :public IsContentCharMap + { + IsDQStringCharMap() + { + (*this)(Dq,false)(Space,true); + } + }; +}; + + +/// \class XMLScanner +/// \brief XML scanner template that adds the functionality to the statemachine base definition +/// \tparam InputIterator input iterator with ++ and read only * returning 0 als last character of the input +/// \tparam InputCharSet_ character set encoding of the input, read as stream of bytes +/// \tparam OutputCharSet_ character set encoding of the output, printed as string of the item type of the character set, +/// \tparam OutputBuffer_ buffer for output with STL back insertion sequence interface (e.g. std::string,std::vector<char>,textwolf::StaticBuffer) +template +< + class InputIterator, + class InputCharSet_, + class OutputCharSet_, + class OutputBuffer_ +> +class XMLScanner :public XMLScannerBase +{ +private: + /// \class TokState + /// \brief Token state variables + struct TokState + { + /// \enum Id + /// \brief Enumeration of token parser states. + /// \remark These states define where the scanner has to continue parsing when it was interrupted by an EoD exception and reentered again with more input to process. + enum Id + { + Start, ///< start state (no parsing action performed at the moment) + ParsingDone, ///< scanner war interrupted after parsing something when accessing the follow character + ParsingKey, ///< scanner was interrupted when parsing a key + ParsingEntity, ///< scanner was interrupted when parsing an XML character entity + ParsingNumericEntity, ///< scanner was interrupted when parsing an XML numeric character entity + ParsingNumericBaseEntity, ///< scanner was interrupted when parsing an XML basic character entity (apos,amp,etc..) + ParsingNamedEntity, ///< scanner was interrupted when parsing an XML named character entity + ParsingToken ///< scanner was interrupted when parsing a token (not in entity cotext) + }; + Id id; ///< the scanner token parser state + + enum EolnState ///< end of line state to fulfill the W3C requirements for end of line mapping (see http://www.w3.org/TR/xml/: 2.11 End-of-Line Handling) + { + SRC,CR + }; + EolnState eolnState; ///< the scanner end of line state + + unsigned int pos; ///< entity buffer position (buf) + unsigned int base; ///< numeric entity base (10 for decimal/16 for hexadecimal) + EChar value; ///< parsed entity value + char buf[ 16]; ///< parsed entity buffer + UChar curchr_saved; ///< save current character parsed for the case we cannot print it (output buffer too small) + + /// \brief Constructor + TokState() :id(Start),eolnState(SRC),pos(0),base(0),value(0),curchr_saved(0) {} + + /// \brief Reset this state variables (after succesful exit with a new token parsed) + /// \param [in] id_ the new entity parse state + /// \param [in] eolnState_ the end of line mapping state + void init(Id id_=Start, EolnState eolnState_=SRC) + { + id=id_;eolnState=eolnState_;pos=0;base=0;value=0;curchr_saved=0; + } + }; + TokState tokstate; ///< the entity parsing state of this XML scanner + +public: + typedef InputCharSet_ InputCharSet; + typedef OutputCharSet_ OutputCharSet; + class iterator; + +public: + typedef TextScanner<InputIterator,InputCharSet_> InputReader; + typedef XMLScanner<InputIterator,InputCharSet_,OutputCharSet_,OutputBuffer_> ThisXMLScanner; + typedef std::map<const char*,UChar> EntityMap; + typedef OutputBuffer_ OutputBuffer; + +private: + /// \brief Print a character to the output token buffer + /// \param [in] ch unicode character to print + void push( UChar ch) + { + m_output.print( ch, m_outputBuf); + } + + void copychar_impl( const traits::TypeCheck::YES&) + { + m_src.copychar( m_output, m_outputBuf); + } + + void copychar_impl( const traits::TypeCheck::NO&) + { + push( m_src.chr()); + } + + void copychar() + { + copychar_impl( traits::TypeCheck::is_same<InputCharSet,OutputCharSet>::type()); + } + + /// \brief Map a hexadecimal digit to its value + /// \param [in] ch hexadecimal digit to map to its decimal value + static unsigned char HEX( unsigned char ch) + { + struct HexCharMap :public CharMap<unsigned char, 0xFF> + { + HexCharMap() + { + (*this) + ('0',0) ('1', 1)('2', 2)('3', 3)('4', 4)('5', 5)('6', 6)('7', 7)('8', 8)('9', 9) + ('A',10)('B',11)('C',12)('D',13)('E',14)('F',15)('a',10)('b',11)('c',12)('d',13)('e',14)('f',15); + } + }; + static HexCharMap hexCharMap; + return hexCharMap[ch]; + } + + /// \brief Parse a numeric entity value for a table definition (map it to the target character set) + /// \param [in] ir input reader + /// \return the value of the entity parsed + static UChar parseStaticNumericEntityValue( InputReader& ir) + { + EChar value = 0; + unsigned char ch = ir.ascii(); + unsigned int base; + if (ch != '#') return 0; + ir.skip(); + ch = ir.ascii(); + if (ch == 'x') + { + ir.skip(); + ch = ir.ascii(); + base = 16; + } + else + { + base = 10; + } + while (ch != ';') + { + unsigned char chval = HEX(ch); + if (value >= base) return 0; + value = value * base + chval; + if (value >= 0xFFFFFFFF) return 0; + ir.skip(); + ch = ir.ascii(); + } + return (UChar)value; + } + + /// \brief Print the characters of a sequence that was thought to form an entity but did not + /// \return true on success + void fallbackEntity() + { + switch (tokstate.id) + { + case TokState::Start: + case TokState::ParsingDone: + case TokState::ParsingKey: + case TokState::ParsingToken: + break; + case TokState::ParsingEntity: + push('&'); + break; + case TokState::ParsingNumericEntity: + push('&'); + push('#'); + break; + case TokState::ParsingNumericBaseEntity: + push('&'); + push('#'); + for (unsigned int ii=0; ii<tokstate.pos; ii++) push( tokstate.buf[ii]); + break; + case TokState::ParsingNamedEntity: + push('&'); + for (unsigned int ii=0; ii<tokstate.pos; ii++) push( tokstate.buf[ii]); + break; + } + } + + /// \brief Try to parse an entity (we got '&') + /// \return true on success + bool parseEntity() + { + unsigned char ch; + tokstate.id = TokState::ParsingEntity; + ch = m_src.ascii(); + if (ch == '#') + { + m_src.skip(); + return parseNumericEntity(); + } + else + { + return parseNamedEntity(); + } + } + + /// \brief Try to parse a numeric entity (we got '&#') + /// \return true on success + bool parseNumericEntity() + { + unsigned char ch; + tokstate.id = TokState::ParsingNumericEntity; + ch = m_src.ascii(); + if (ch == 'x') + { + tokstate.base = 16; + m_src.skip(); + return parseNumericBaseEntity(); + } + else + { + tokstate.base = 10; + return parseNumericBaseEntity(); + } + } + + /// \brief Try to parse a numeric entity with known base (we got '&#' and we know the base 10/16 of it) + /// \return true on success + bool parseNumericBaseEntity() + { + unsigned char ch; + tokstate.id = TokState::ParsingNumericBaseEntity; + + while (tokstate.pos < sizeof(tokstate.buf)) + { + ch = m_src.ascii(); + if (ch == ';') + { + if (tokstate.value > 0xFFFFFFFF) + { + tokstate.buf[ tokstate.pos++] = ch; + fallbackEntity(); + return true; + } + push( (UChar)tokstate.value); + tokstate.init( TokState::ParsingToken); + m_src.skip(); + return true; + } + else + { + unsigned char chval = HEX(ch); + if (chval >= tokstate.base) + { + fallbackEntity(); + return true; + } + tokstate.buf[ tokstate.pos++] = ch; + tokstate.value = tokstate.value * tokstate.base + chval; + m_src.skip(); + } + } + fallbackEntity(); + return true; + } + + /// \brief Try to parse a named entity + /// \return true on success + bool parseNamedEntity() + { + unsigned char ch; + tokstate.id = TokState::ParsingNamedEntity; + ch = m_src.ascii(); + while (tokstate.pos < sizeof(tokstate.buf)-1 && ch != ';' && m_src.control() == Any) + { + tokstate.buf[ tokstate.pos] = ch; + m_src.skip(); + tokstate.pos++; + ch = m_src.ascii(); + } + if (ch == ';') + { + tokstate.buf[ tokstate.pos] = '\0'; + if (!pushEntity( tokstate.buf)) return false; + tokstate.init( TokState::ParsingToken); + m_src.skip(); + return true; + } + else + { + fallbackEntity(); + return true; + } + } + + /// \brief Try to recover from an interrupted token parsing state (end of input exception) + /// \return true on success + bool parseTokenRecover() + { + bool rt = false; + if (tokstate.curchr_saved) + { + push( tokstate.curchr_saved); + tokstate.curchr_saved = 0; + } + switch (tokstate.id) + { + case TokState::Start: + case TokState::ParsingDone: + case TokState::ParsingKey: + case TokState::ParsingToken: + error = ErrInternal; + return false; + case TokState::ParsingEntity: rt = parseEntity(); break; + case TokState::ParsingNumericEntity: rt = parseNumericEntity(); break; + case TokState::ParsingNumericBaseEntity: rt = parseNumericBaseEntity(); break; + case TokState::ParsingNamedEntity: rt = parseNamedEntity(); break; + } + tokstate.init( TokState::ParsingToken); + return rt; + } + + /// \brief Parse a token defined by the set of valid token characters + /// \param [in] isTok set of valid token characters + /// \return true on success + bool parseToken( const IsTokenCharMap& isTok) + { + if (tokstate.id == TokState::Start) + { + tokstate.id = TokState::ParsingToken; + m_outputBuf.clear(); + } + else if (tokstate.id != TokState::ParsingToken) + { + if (!parseTokenRecover()) + { + tokstate.init(); + return false; + } + } + for (;;) + { + /// \todo When source and dest encoding are equal, then do not decode + /// the value in parsing and encode it when printing. Use some sort + /// of enable_if do redirect to a simple buffer copy. + ControlCharacter ch; + while (isTok[ (unsigned char)(ch=m_src.control())]) + { + unsigned char aa = m_src.ascii(); + if (aa <= 0xD) + { + //handling W3C requirements for end of line translation in XML: + if (aa == '\r') + { + push( (unsigned char)'\n'); + tokstate.eolnState = TokState::CR; + } + else if (aa == '\n') + { + if (tokstate.eolnState != TokState::CR) + { + push( (unsigned char)'\n'); + } + tokstate.eolnState = TokState::SRC; + } + else + { + copychar(); + tokstate.eolnState = TokState::SRC; + } + } + else + { + copychar(); + tokstate.eolnState = TokState::SRC; + } + m_src.skip(); + } + if (ch == Amp) + { + m_src.skip(); + if (!parseEntity()) break; + tokstate.init( TokState::ParsingToken); + continue; + } + else + { + tokstate.init( TokState::ParsingDone); + return true; + } + } + tokstate.init(); + return false; + } + +public: + /// \brief Static version of parse a token for parsing table definition elements + /// \tparam OutputBufferType type buffer for output + /// \param [in] isTok set of valid token characters + /// \param [in] ir input reader iterator + /// \param [out] buf buffer where to write the result to + /// \return true on success + template <class OutputBufferType> + static bool parseStaticToken( const IsTokenCharMap& isTok, InputReader ir, OutputBufferType& buf) + { + static OutputCharSet output; + buf.clear(); + for (;;) + { + ControlCharacter ch; + for (;;) + { + UChar pc; + if (isTok[ (unsigned char)(ch=ir.control())]) + { + pc = ir.chr(); + } + else if (ch == Amp) + { + pc = parseStaticNumericEntityValue( ir); + } + else + { + return true; + } + output.print( pc, buf); + ir.skip(); + } + } + } + +private: + /// \brief Skip a token defined by the set of valid token characters (same as parseToken but nothing written to the output buffer) + /// \param [in] isTok set of valid token characters + /// \return true on success + bool skipToken( const IsTokenCharMap& isTok) + { + do + { + ControlCharacter ch; + while (isTok[ (unsigned char)(ch=m_src.control())] || ch == Amp) + { + m_src.skip(); + } + } + while (m_src.control() == Any); + return true; + } + + /// \brief Parse a token that must be the same as a given string + /// \param [in] str string expected + /// \return true on success + bool expectStr( const char* str) + { + bool rt = true; + tokstate.id = TokState::ParsingKey; + for (; str[tokstate.pos] != '\0'; m_src.skip(),tokstate.pos++) + { + if (m_src.ascii() == str[ tokstate.pos]) continue; + ControlCharacter ch = m_src.control(); + if (ch == EndOfText) + { + error = ErrUnexpectedEndOfText; + } + else + { + error = ErrSyntaxToken; + } + rt = false; + break; + } + tokstate.init( TokState::ParsingDone); + return rt; + } + + /// \brief Parse an entity defined by name (predefined) + /// \param [in] str pointer to the buffer with the entity name + /// \return true on success + bool pushPredefinedEntity( const char* str) + { + switch (str[0]) + { + case 'q': + if (str[1] == 'u' && str[2] == 'o' && str[3] == 't' && str[4] == '\0') + { + push( '\"'); + return true; + } + break; + + case 'a': + if (str[1] == 'm') + { + if (str[2] == 'p' && str[3] == '\0') + { + push( '&'); + return true; + } + } + else if (str[1] == 'p') + { + if (str[2] == 'o' && str[3] == 's' && str[4] == '\0') + { + push( '\''); + return true; + } + } + break; + + case 'l': + if (str[1] == 't' && str[2] == '\0') + { + push( '<'); + return true; + } + break; + + case 'g': + if (str[1] == 't' && str[2] == '\0') + { + push( '>'); + return true; + } + break; + + case 'n': + if (str[1] == 'b' && str[2] == 's' && str[3] == 'p' && str[4] == '\0') + { + push( ' '); + return true; + } + break; + } + return false; + } + + /// \brief Parse an entity defined by name (predefined or in defined in entity table) + /// \param [in] str pointer to the buffer with the entity name + /// \return true on success + bool pushEntity( const char* str) + { + if (pushPredefinedEntity( str)) + { + return true; + } + else if (m_entityMap) + { + EntityMap::const_iterator itr = m_entityMap->find( str); + if (itr == m_entityMap->end()) + { + error = ErrUndefinedCharacterEntity; + return false; + } + else + { + UChar ch = itr->second; + push( ch); + return true; + } + } + else + { + error = ErrUndefinedCharacterEntity; + return false; + } + } + +private: + STMState state; ///< current state of the XML scanner + Error error; ///< last error code + InputReader m_src; ///< source input iterator + const EntityMap* m_entityMap; ///< map with entities defined by the caller + OutputBuffer m_outputBuf; ///< buffer to use for output + OutputCharSet m_output; + +public: + /// \brief Constructor + /// \param [in] p_src source iterator + /// \param [in] p_entityMap read only map of named entities defined by the user + XMLScanner( const InputIterator& p_src, const EntityMap& p_entityMap) + :state(START),error(Ok),m_src(InputCharSet(),p_src),m_entityMap(&p_entityMap),m_output(OutputCharSet()) + {} + /// \brief Constructor + /// \param [in] p_src source iterator + XMLScanner( const InputIterator& p_src) + :state(START),error(Ok),m_src(InputCharSet(),p_src),m_entityMap(0),m_output(OutputCharSet()) + {} + /// \brief Constructor + /// \param [in] p_charset character set encoding of input in case of non default settings (code page) needed + /// \param [in] p_src source iterator + /// \param [in] p_entityMap read only map of named entities defined by the user + XMLScanner( const InputCharSet& p_charset, const InputIterator& p_src, const EntityMap& p_entityMap) + :state(START),error(Ok),m_src(p_charset,p_src),m_entityMap(&p_entityMap),m_output(OutputCharSet()) + {} + /// \brief Constructor + /// \param [in] p_charset character set encoding of input in case of non default settings (code page) needed + /// \param [in] p_src source iterator + XMLScanner( const InputCharSet& p_charset, const InputIterator& p_src) + :state(START),error(Ok),m_src(p_charset,p_src),m_entityMap(0),m_output(OutputCharSet()) + {} + /// \brief Constructor + /// \param [in] p_charset character set encoding of input in case of non default settings (code page) needed + XMLScanner( const InputCharSet& p_charset) + :state(START),error(Ok),m_src(p_charset),m_entityMap(0) + {} + /// \brief Default constructor + XMLScanner() + :state(START),error(Ok),m_src(InputCharSet()),m_entityMap(0) + {} + + /// \brief Copy constructor + /// \param [in] o scanner to copy + XMLScanner( const XMLScanner& o) + :state(o.state) + ,error(o.error) + ,m_src(o.m_src) + ,m_entityMap(o.m_entityMap) + ,m_outputBuf(o.m_outputBuf) + {} + + /// \brief Assign something to the source iterator while keeping the state + /// \param [in] a source iterator assignment + template <class IteratorAssignment> + void setSource( const IteratorAssignment& a) + { + m_src.setSource( a); + } + + /// \brief Get the current source iterator position + /// \return source iterator position in character words (usually bytes) + std::size_t getPosition() const + { + return m_src.getPosition(); + } + + /// \brief Get the current parsed XML element pointer, if it was not masked out, see nextItem(unsigned short) + /// \return the item string + const char* getItemPtr() const {return m_outputBuf.size()?&m_outputBuf.at(0):"\0\0\0\0";} + + /// \brief Get the size of the current parsed XML element in bytes + /// \return the item string + std::size_t getItemSize() const {return m_outputBuf.size();} + + /// \brief Get the current parsed XML element, if it was not masked out, see nextItem(unsigned short) + /// \return the item string + const OutputBuffer& getItem() const + { + return m_outputBuf; + } + + /// \brief Get the current XML scanner state machine state + /// \return pointer to the state variables + ScannerStatemachine::Element* getState() + { + static Statemachine stm; + return stm.get( state); + } + + /// \brief Get the last error + /// \param [out] str the error as string + /// \return the error code + Error getError( const char** str=0) + { + Error rt = error; + error = Ok; + if (str) *str=getErrorString(rt); + return rt; + } + + /// \brief Scan the next XML element + /// \param [in] mask element types that should be printed to the output buffer (1 -> print, 0 -> mask out, just return the element as event) + /// \return the type of the XML element + ElementType nextItem( unsigned short mask=0xFFFF) + { + static const IsWordCharMap wordC; + static const IsContentCharMap contentC; + static const IsTagCharMap tagC; + static const IsSQStringCharMap sqC; + static const IsDQStringCharMap dqC; + static const IsTokenCharMap* tokenDefs[ NofSTMActions] = {0,&wordC,&contentC,&tagC,&sqC,&dqC,0,0,0}; + static const char* stringDefs[ NofSTMActions] = {0,0,0,0,0,0,"xml","CDATA",0}; + + ElementType rt = None; + ControlCharacter ch; + do + { + ScannerStatemachine::Element* sd = getState(); + if (sd->action.op != -1) + { + if (tokenDefs[sd->action.op]) + { + if (tokstate.id != TokState::ParsingDone) + { + if ((mask&(1<<sd->action.arg)) != 0) + { + if (!parseToken( *tokenDefs[ sd->action.op])) return ErrorOccurred; + } + else + { + if (!skipToken( *tokenDefs[ sd->action.op])) return ErrorOccurred; + } + } + rt = (ElementType)sd->action.arg; + } + else if (stringDefs[sd->action.op]) + { + if (tokstate.id != TokState::ParsingDone) + { + if (!expectStr( stringDefs[sd->action.op])) return ErrorOccurred; + if (sd->action.op == ExpectIdentifierXML) + { + //... special treatement for xml header for not + // enforcing the model too much just for this case + push( '?'); push( 'x'); push( 'm'); push( 'l'); + rt = HeaderStart; + } + } + else if (sd->action.op == ExpectIdentifierXML) + { + //... special treatement for xml header for not + // enforcing the model too much just for this case + rt = HeaderStart; + } + } + else + { + m_outputBuf.clear(); + rt = (ElementType)sd->action.arg; + } + if (sd->nofnext == 0) + { + if (sd->fallbackState != -1) + { + state = (STMState)sd->fallbackState; + } + return rt; + } + } + ch = m_src.control(); + tokstate.id = TokState::Start; + + if (sd->next[ ch] != -1) + { + state = (STMState)sd->next[ ch]; + m_src.skip(); + } + else if (sd->fallbackState != -1) + { + state = (STMState)sd->fallbackState; + } + else if (sd->missError != -1) + { + error = (Error)sd->missError; + return ErrorOccurred; + } + else if (ch == EndOfText) + { + error = ErrUnexpectedEndOfText; + return ErrorOccurred; + } + else + { + error = ErrInternal; + return ErrorOccurred; + } + } + while (rt == None); + return rt; + } + + /// \class End + /// \brief end of input tag + struct End {}; + + /// \class iterator + /// \brief input iterator for iterating on the output of an XML scanner + class iterator + { + public: + /// \class Element + /// \brief Iterator element visited + class Element + { + private: + friend class iterator; + ElementType m_type; ///< type of the element + const char* m_content; ///< value string of the element + std::size_t m_size; ///< size of the value string in bytes + public: + /// \brief Type of the current element as string + const char* name() const {return getElementTypeName( m_type);} + /// \brief Type of the current element + ElementType type() const {return m_type;} + /// \brief Value of the current element + const char* content() const {return m_content;} + /// \brief Size of the value of the current element in bytes + std::size_t size() const {return m_size;} + /// \brief Constructor + Element() :m_type(None),m_content(0),m_size(0) {} + /// \brief Constructor + Element( const End&) :m_type(Exit),m_content(0),m_size(0) {} + /// \brief Copy constructor + /// \param [in] orig element to copy + Element( const Element& orig) :m_type(orig.m_type),m_content(orig.m_content),m_size(orig.m_size) {} + }; + // input iterator traits + typedef Element value_type; + typedef std::size_t difference_type; + typedef std::size_t size_type; + typedef Element* pointer; + typedef Element& reference; + typedef std::input_iterator_tag iterator_category; + + private: + Element element; ///< currently visited element + ThisXMLScanner* input; ///< XML scanner + + /// \brief Skip to the next element + /// \param [in] mask element types that should be printed to the output buffer (1 -> print, 0 -> mask out, just return the element as event) + /// \return iterator pointing to the next element + iterator& skip( unsigned short mask=0xFFFF) + { + if (input != 0) + { + element.m_type = input->nextItem(mask); + element.m_content = input->getItemPtr(); + element.m_size = input->getItemSize(); + } + return *this; + } + + /// \brief Compare iterator with another + /// \param [in] iter iterator to compare with + /// \return true if they are equal + bool compare( const iterator& iter) const + { + if (element.type() == iter.element.type()) + { + if (element.type() == Exit || element.type() == None) return true; //equal only at beginning and end + } + return false; + } + public: + /// \brief Assign an iterator to another + /// \param [in] orig iterator to copy + void assign( const iterator& orig) + { + input = orig.input; + element = orig.element; + } + /// \brief Copy constructor + /// \param [in] orig iterator to copy + iterator( const iterator& orig) + { + assign( orig); + } + /// \brief Constructor + /// \param [in] p_input XML scanner to use for iteration + /// \param [in] doSkipToFirst true, if the iterator should skip to the first character of the input (default behaviour of STL conform iterators but maybe not exception save) + iterator( ThisXMLScanner& p_input, bool doSkipToFirst=true) + :input( &p_input) + { + if (doSkipToFirst) + { + element.m_type = input->nextItem(); + element.m_content = input->getItemPtr(); + element.m_size = input->getItemSize(); + } + } + /// \brief Constructor + iterator( const End& et) :element(et),input(0) {} + /// \brief Constructor + iterator() :input(0) {} + /// \brief Assignement operator + /// \param [in] orig iterator to assign to this + iterator& operator = (const iterator& orig) + { + assign( orig); + return *this; + } + /// \brief Element dereference operator + const Element& operator*() const + { + return element; + } + /// \brief Element dereference operator + const Element* operator->() const + { + return &element; + } + /// \brief Preincrement + /// \return *this + iterator& operator++() {return skip();} + /// \brief Postincrement + /// \return *this + iterator operator++(int) {iterator tmp(*this); skip(); return tmp;} + + /// \brief Compare to check for equality + /// \return true, if equal + bool operator==( const iterator& iter) const {return compare( iter);} + /// \brief Compare to check for unequality + /// \return true, if not equal + bool operator!=( const iterator& iter) const {return !compare( iter);} + }; + + /// \brief Get begin iterator + /// \return iterator + /// \param [in] doSkipToFirst true, if the iterator should skip to the first character of the input (default behaviour of STL conform iterators but maybe not exception save) + iterator begin( bool doSkipToFirst=true) + { + return iterator( *this, doSkipToFirst); + } + /// \brief Get the pointer to the end of content + /// \return iterator + iterator end() + { + return iterator( End()); + } +}; + +}//namespace +#endif + + diff --git a/textwolf/include/textwolf/xmltagstack.hpp b/textwolf/include/textwolf/xmltagstack.hpp new file mode 100644 index 0000000..a4671fe --- /dev/null +++ b/textwolf/include/textwolf/xmltagstack.hpp @@ -0,0 +1,146 @@ +/* +--------------------------------------------------------------------- + The template library textwolf implements an input iterator on + a set of XML path expressions without backward references on an + STL conforming input iterator as source. It does no buffering + or read ahead and is dedicated for stream processing of XML + for a small set of XML queries. + Stream processing in this context refers to processing the + document without buffering anything but the current result token + processed with its tag hierarchy information. + + Copyright (C) 2010,2011,2012,2013,2014 Patrick Frey + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 3.0 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +-------------------------------------------------------------------- + + The latest version of textwolf can be found at 'http://github.com/patrickfrey/textwolf' + For documentation see 'http://patrickfrey.github.com/textwolf' + +-------------------------------------------------------------------- +*/ +/// \file textwolf/xmltagstack.hpp +/// \brief textwolf XML printer tag stack + +#ifndef __TEXTWOLF_XML_TAG_STACK_HPP__ +#define __TEXTWOLF_XML_TAG_STACK_HPP__ +#include <cstring> +#include <cstdlib> + +/// \namespace textwolf +/// \brief Toplevel namespace of the library +namespace textwolf { + +/// \class TagStack +/// \brief stack of tag names +class TagStack +{ +public: + /// \brief Destructor + ~TagStack() + { + if (m_ptr) std::free( m_ptr); + } + + /// \brief Default constructor + TagStack() + :m_ptr(0),m_pos(0),m_size(InitSize) + { + if ((m_ptr=(char*)std::malloc( m_size)) == 0) throw std::bad_alloc(); + } + /// \brief Copy constructor + TagStack( const TagStack& o) + :m_ptr(0),m_pos(o.m_pos),m_size(o.m_size) + { + if ((m_ptr=(char*)std::malloc( m_size)) == 0) throw std::bad_alloc(); + std::memcpy( m_ptr, o.m_ptr, m_pos); + } + + /// \brief Push a tag on top + /// \param[out] pp pointer to tag value to push + /// \param[out] nn size of tag value to push in bytes + void push( const char* pp, std::size_t nn) + { + std::size_t align = getAlign( nn); + std::size_t ofs = nn + align + sizeof( std::size_t); + if (m_pos + ofs > m_size) + { + while (m_pos + ofs > m_size) m_size *= 2; + if (m_pos + ofs > m_size) throw std::bad_alloc(); + if (nn > ofs) throw std::logic_error( "invalid tag offset"); + char* xx = (char*)std::realloc( m_ptr, m_size); + if (!xx) throw std::bad_alloc(); + m_ptr = xx; + } + std::memcpy( m_ptr + m_pos, pp, nn); + m_pos += ofs; + void* tt = m_ptr + m_pos - sizeof( std::size_t); + *(std::size_t*)(tt) = nn; + } + + /// \brief Get the topmost tag + /// \param[out] element pointer to topmost tag value + /// \param[out] elementsize size of topmost tag value in bytes + /// \return true on success, false if the stack is empty + bool top( const void*& element, std::size_t& elementsize) + { + std::size_t ofs = topofs(elementsize); + if (!ofs) return false; + element = m_ptr + m_pos - ofs; + return true; + } + + /// \brief Pop (remove) the topmost tag + void pop() + { + std::size_t elementsize=0; + std::size_t ofs = topofs(elementsize); + if (m_pos < ofs) throw std::runtime_error( "corrupt tag stack"); + m_pos -= ofs; + } + + /// \brief Find out if the stack is empty + /// \return true if yes + bool empty() const + { + return (m_pos == 0); + } + +private: + std::size_t topofs( std::size_t& elementsize) + { + if (m_pos < sizeof( std::size_t)) return false; + void* tt = m_ptr + (m_pos - sizeof( std::size_t)); + elementsize = *(std::size_t*)(tt); + std::size_t align = getAlign( elementsize); + std::size_t ofs = elementsize + align + sizeof( std::size_t); + if (ofs > m_pos) return 0; + return ofs; + } +private: + enum {InitSize=256}; + char* m_ptr; + std::size_t m_pos; ///< current position in the tag hierarchy stack buffer + std::size_t m_size; ///< current position in the tag hierarchy stack buffer + + static std::size_t getAlign( std::size_t n) + { + return (sizeof(std::size_t) - (n & (sizeof(std::size_t)-1))) & (sizeof(std::size_t)-1); + } +}; + +} //namespace +#endif diff --git a/textwolf/license.txt b/textwolf/license.txt new file mode 100644 index 0000000..65c5ca8 --- /dev/null +++ b/textwolf/license.txt @@ -0,0 +1,165 @@ + GNU LESSER GENERAL PUBLIC LICENSE + Version 3, 29 June 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/> + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + + This version of the GNU Lesser General Public License incorporates +the terms and conditions of version 3 of the GNU General Public +License, supplemented by the additional permissions listed below. + + 0. Additional Definitions. + + As used herein, "this License" refers to version 3 of the GNU Lesser +General Public License, and the "GNU GPL" refers to version 3 of the GNU +General Public License. + + "The Library" refers to a covered work governed by this License, +other than an Application or a Combined Work as defined below. + + An "Application" is any work that makes use of an interface provided +by the Library, but which is not otherwise based on the Library. +Defining a subclass of a class defined by the Library is deemed a mode +of using an interface provided by the Library. + + A "Combined Work" is a work produced by combining or linking an +Application with the Library. The particular version of the Library +with which the Combined Work was made is also called the "Linked +Version". + + The "Minimal Corresponding Source" for a Combined Work means the +Corresponding Source for the Combined Work, excluding any source code +for portions of the Combined Work that, considered in isolation, are +based on the Application, and not on the Linked Version. + + The "Corresponding Application Code" for a Combined Work means the +object code and/or source code for the Application, including any data +and utility programs needed for reproducing the Combined Work from the +Application, but excluding the System Libraries of the Combined Work. + + 1. Exception to Section 3 of the GNU GPL. + + You may convey a covered work under sections 3 and 4 of this License +without being bound by section 3 of the GNU GPL. + + 2. Conveying Modified Versions. + + If you modify a copy of the Library, and, in your modifications, a +facility refers to a function or data to be supplied by an Application +that uses the facility (other than as an argument passed when the +facility is invoked), then you may convey a copy of the modified +version: + + a) under this License, provided that you make a good faith effort to + ensure that, in the event an Application does not supply the + function or data, the facility still operates, and performs + whatever part of its purpose remains meaningful, or + + b) under the GNU GPL, with none of the additional permissions of + this License applicable to that copy. + + 3. Object Code Incorporating Material from Library Header Files. + + The object code form of an Application may incorporate material from +a header file that is part of the Library. You may convey such object +code under terms of your choice, provided that, if the incorporated +material is not limited to numerical parameters, data structure +layouts and accessors, or small macros, inline functions and templates +(ten or fewer lines in length), you do both of the following: + + a) Give prominent notice with each copy of the object code that the + Library is used in it and that the Library and its use are + covered by this License. + + b) Accompany the object code with a copy of the GNU GPL and this license + document. + + 4. Combined Works. + + You may convey a Combined Work under terms of your choice that, +taken together, effectively do not restrict modification of the +portions of the Library contained in the Combined Work and reverse +engineering for debugging such modifications, if you also do each of +the following: + + a) Give prominent notice with each copy of the Combined Work that + the Library is used in it and that the Library and its use are + covered by this License. + + b) Accompany the Combined Work with a copy of the GNU GPL and this license + document. + + c) For a Combined Work that displays copyright notices during + execution, include the copyright notice for the Library among + these notices, as well as a reference directing the user to the + copies of the GNU GPL and this license document. + + d) Do one of the following: + + 0) Convey the Minimal Corresponding Source under the terms of this + License, and the Corresponding Application Code in a form + suitable for, and under terms that permit, the user to + recombine or relink the Application with a modified version of + the Linked Version to produce a modified Combined Work, in the + manner specified by section 6 of the GNU GPL for conveying + Corresponding Source. + + 1) Use a suitable shared library mechanism for linking with the + Library. A suitable mechanism is one that (a) uses at run time + a copy of the Library already present on the user's computer + system, and (b) will operate properly with a modified version + of the Library that is interface-compatible with the Linked + Version. + + e) Provide Installation Information, but only if you would otherwise + be required to provide such information under section 6 of the + GNU GPL, and only to the extent that such information is + necessary to install and execute a modified version of the + Combined Work produced by recombining or relinking the + Application with a modified version of the Linked Version. (If + you use option 4d0, the Installation Information must accompany + the Minimal Corresponding Source and Corresponding Application + Code. If you use option 4d1, you must provide the Installation + Information in the manner specified by section 6 of the GNU GPL + for conveying Corresponding Source.) + + 5. Combined Libraries. + + You may place library facilities that are a work based on the +Library side by side in a single library together with other library +facilities that are not Applications and are not covered by this +License, and convey such a combined library under terms of your +choice, if you do both of the following: + + a) Accompany the combined library with a copy of the same work based + on the Library, uncombined with any other library facilities, + conveyed under the terms of this License. + + b) Give prominent notice with the combined library that part of it + is a work based on the Library, and explaining where to find the + accompanying uncombined form of the same work. + + 6. Revised Versions of the GNU Lesser General Public License. + + The Free Software Foundation may publish revised and/or new versions +of the GNU Lesser General Public License from time to time. Such new +versions will be similar in spirit to the present version, but may +differ in detail to address new problems or concerns. + + Each version is given a distinguishing version number. If the +Library as you received it specifies that a certain numbered version +of the GNU Lesser General Public License "or any later version" +applies to it, you have the option of following the terms and +conditions either of that published version or of any later version +published by the Free Software Foundation. If the Library as you +received it does not specify a version number of the GNU Lesser +General Public License, you may choose any version of the GNU Lesser +General Public License ever published by the Free Software Foundation. + + If the Library as you received it specifies that a proxy can decide +whether future versions of the GNU Lesser General Public License shall +apply, that proxy's public statement of acceptance of any version is +permanent authorization for you to choose that version for the +Library. |