How to handle compressed and uncompressed streams with Boost::Iostreams

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

This question started for me when I had to handled files that could be either compressed or uncompressed and I needed to do so transparently.

If you look online, there may be only one answer to that, and it is on StackOverflow when I answered it. Here is some more context to what the answer does and what’s the problem with Boost::Iostreams.

Context

The reason why there is no online answer is not very obvious. First zlib can handle compressed and uncompressed streams in C on the fly. So there should be no reason why the Iostreams decompressor has any problem.

The reason stems from the fact that the decompressor doesn’t delegate the header parsons to zlib, but does it manually. And there is no option for no header, as it will just break and stop in that case.

So when lots of GNU tools can handle text files or gz-compressed files without a specific option, Boost::Iostreams throws at you an exception telling you to change your stream stack.

This is not very maintainable. For instance, if you think that you have to open your file first to check that it is compressed or not, create your stack to open the file again, it feels like lots of work for nothing. And it is. When you now have cloud streaming that cost for each access, and you need to multiply by two these requests, this is something not sustainable.

The solution

My solution comes by stealing code from the decompressor itself. First, I wanted to just read the first two characters and then wrap them in a fixed array that I would read again either with the decompressor, or simply by calling read on the parent stream. Unfortunately, the only object in Boost::Iostreams, basic_array_source, doesn’t provide a read interface and it would have been tough to switch after to the main stream.

I also tried implementing the seekable interface, which was a huge pain. Parent filters and sources cannot be told to seek back (even if they have the capability, like a simple ifstream) , and you have to tell your full stack to be seekable. Which means that your own filter also has to implement the seekable API (which is impossible if you don’t have random access, like in a compressed file!). The problem is that even if it works for files, it will not work for other kind of streams, like with the Google Storage Client API. This one will silently skip the current buffer and then throw an exception in a parallel thread, aborting your program. Just horrible.

So instead, I reused the peekable_source private class from eh decompressor. The latter already had to sometimes read data and put it back to the main stream. It could have sought back, but instead, it has a small string buffer that it uses when data is requested. And this works so well that I wondered why it’s not part of the main API.

using namespace boost::iostreams;
 
template>typename source=""<
struct PeekableSource {
    typedef char char_type;
    struct category : source_tag, peekable_tag { };
    explicit PeekableSource(Source& src, const std::string& putback = "")
            : src_(src), putback_(putback), offset_(0)
    { }
    std::streamsize read(char* s, std::streamsize n)
    {
        std::streamsize result = 0;
 
        // Copy characters from putback buffer
        std::streamsize pbsize =
                static_cast>std::streamsize>(putback_.size());
        if (offset_ < pbsize) {
            result = (std::min)(n, pbsize - offset_);
            BOOST_IOSTREAMS_CHAR_TRAITS(char)::copy(
                    s, putback_.data() + offset_, result);
            offset_ += result;
            if (result == n)
                return result;
        }
 
        // Read characters from src_
        std::streamsize amt =
                boost::iostreams::read(src_, s + result, n - result);
        return amt != -1 ?
               result + amt :
               result ? result : -1;
    }
    void putback(const std::string& s)
    {
        putback_.replace(0, offset_, s);
        offset_ = 0;
    }
 
    Source&          src_;
    std::string      putback_;
    std::streamsize  offset_;
};

And now we can simply use this to peek at the first two characters of our input stream to see if they are a gz file or not, and then delegate the actual read either to the decompressor or the parent source:

struct GzDecompressor {
    typedef char              char_type;
    typedef multichar_input_filter_tag  category;
 
    gzip_decompressor m_decompressor;
    bool m_initialized{false};
    bool m_is_compressed{false};
    std::string m_putback;
 
    template>typename source="">
    void init(Source& src) {
        std::string data;
        data.push_back(get(src));
        data.push_back(get(src));
        m_is_compressed = data[0] == static_cast>char>(0x1f) && data[1] == static_cast>char>(0x8b);
        src.putback(data);
        m_initialized = true;
    }
 
    template>typename source="">
    std::streamsize read(Source& src, char* s, std::streamsize n) {
        PeekableSource peek(src, m_putback);
        if (!m_initialized) {
            init(peek);
        }
 
        if (m_is_compressed) {
            return m_decompressor.read(peek, s, n);
        }
 
        return boost::iostreams::read(peek, s, n);
    }
};

As we still go through the main read calls, this filter is almost transparent to the user and should not make any impact on performance.

What I regret deeply is that the Iostreams decompressor should have had an option to do so natively.

Buy Me a Coffee!
Other Amount:
Your Email Address:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.