当前位置:主页 > 资料 >

Introduction to the rank-select bit-string
栏目分类:资料   发布日期:2018-08-02   浏览次数:

导读:本文为去找网小编(www.7zhao.net)为您推荐的Introduction to the rank-select bit-string,希望对您有所帮助,谢谢! Last week we looked at how traditional whole-document parsers struggle to parse big files. Such parsers

本文为去找网小编(www.7zhao.net)为您推荐的Introduction to the rank-select bit-string,希望对您有所帮助,谢谢!

欢迎访问www.7zhao.net



Last week we looked at how traditional whole-document parsers struggle to parse big files.

欢迎访问www.7zhao.net

Such parsers both too use much memory and are too slow, being orders of magnitude slower than what IO bandwidth would allow.

欢迎访问www.7zhao.net

In some sense, we can understand the slowness as a consequence large memory usage: All that memory access does not come for free.

去找(www.7zhao.net欢迎您

Traditional whole-document parsers spend a lot of time allocating memory, assigning pointers, following indirections and touching new memory that isn’t cached in the CPU cache where it could have been accessed much more quickly. www.7zhao.net

To achieve high performance, the parser needs to do as little possible, but traditional parsers are actually creating a lot of work for the hardware, that is incidental to solving the problem of making the document data accessible and much of that overhead is hidden from us, the developer, by language and hardware abstractions, so the overhead is easy to overlook.

内容来自www.7zhao.net

We have seen how objects are severely expensive, especially when allocated en-mass. For our use-case their cost disproportionately exceeds their utility. copyright www.7zhao.net

If not objects then what

If we want to minimise memory usage, the first thing you should do is avoid duplicating data.

本文来自去找www.7zhao.net

All the data we want to access is already in the document. Copying that data into intermediate objects so that we can work with them is wasteful. 欢迎访问www.7zhao.net

Instead we want to reuse the data in its original form as much as possible and leave the parsing of small individual elements of the document to the last moment just before we use them and in that way avoid parsing all the parts of the document we don’t need. 内容来自www.7zhao.net

What prevents us from doing that exactly that is that we lack the means to navigate the structure of the document to locate the various pieces of data in the document we are interested in.

欢迎访问www.7zhao.net

This structure is exactly what our object-based document object model provided, but to avoid paying their dues, we must find another way. 欢迎访问www.7zhao.net

But if not objects, then what? copyright www.7zhao.net

If you want to find something extremely small and light weight, you could do worse than choose the humble bit. It is 64 times smaller than a pointer and many multiples of that smaller than an object.

内容来自www.7zhao.net

So let’s do exactly that: Let’s use bits!

copyright www.7zhao.net

The rank-select bit-string

Before we can pull this off we are going to have to learn us some concepts.

去找(www.7zhao.net欢迎您

Imagine a string of bits. Not unlike the following: 本文来自去找www.7zhao.net

 

copyright www.7zhao.net

In order to query this bit-string we will be using two very powerful query operations rank and select .

copyright www.7zhao.net

The pseudocode for these to operations are provided below in Haskell which you can drop into your Haskell repl to observe their behaviours:

copyright www.7zhao.net










 欢迎访问www.7zhao.net 

The first function popCount1 is the population operation (sometimes called the ). It tells us how many 1 bits there are in our bit-string.

欢迎访问www.7zhao.net

 www.7zhao.net 

The second function rank1 is the rank operation which tells us the population count of the length n prefix of our bit-string.

本文来自去找www.7zhao.net

Here are some example rank queries: 欢迎访问www.7zhao.net









 

欢迎访问www.7zhao.net

The third function select1 is the select operation which tells us smallest prefix of the given bit-string with the given population n . 去找(www.7zhao.net欢迎您

Here are some example select queries: 欢迎访问www.7zhao.net









 欢迎访问www.7zhao.net 

In less precise terms, the rank gives us how many 1s up to a given position n in our bit-string and select gives us the position of the n th 1 in our bit-string. 去找(www.7zhao.net欢迎您

Rank-Select Bit-String as an index into JSON

We will now use the rank-select bit-string to index into JSON, which is to say we will use it to locate interesting locations in our JSON document that correspond to the beginning of JSON nodes. www.7zhao.net

In our semi-index, every bit in the rank-select bit-string corresponds to a byte in the original document of the same position. The value of each bit is chosen such that when the byte represents the start of a node in the document it will be set to 1 . Otherwise it will be set to 0 .

去找(www.7zhao.net欢迎您

For JSON, the beginning of every object (indicated by { ), every array (indiciated by [ ), every field or value will be marked with a 1 , and all other bytes are marked with a 0 .

欢迎访问www.7zhao.net

The example below demonstrates this mapping from bytes to bits:

本文来自去找www.7zhao.net

 

copyright www.7zhao.net

What this gives us is the ability to locate the n th node in the document with rank-select operations. 欢迎访问www.7zhao.net

For example the 6 th node is marked by the 6 th 1 bit at the location that corresponds to the field-name "car" : 本文来自去找www.7zhao.net





 去找(www.7zhao.net欢迎您 

In another example, the 9 th node is marked by the 9 th 1 bit at the location that corresponds to the beginning of the JSON array:

内容来自www.7zhao.net





 
copyright www.7zhao.net

You may notice that each successive 1 bit in the rank-select bit-string identifies nodes of the document according to pre-order traversal. 本文来自去找www.7zhao.net

It is also worth noting that the rank-select bit-string in combination with the original text can be used to identify the type of the node in O(1) time, by testing the character pointed to by the rank-select bit-string.

欢迎访问www.7zhao.net

For example, we know the 6 th node in the document is a string because the character at line 31 is a " , whilst the 9 th node is an array because the character at line 52 is a [ . 欢迎访问www.7zhao.net

Rank-Select Bit-String as an index into CSV

Rank-select bit-strings can be used to index into a CSV document in a similar fashion, but I have chosen to use two rank-select bit-strings instead of one.

www.7zhao.net

Take the following CSV document

copyright www.7zhao.net

"name","age","profession"
John,30,Code Monkey
Kyle,40,Data Scrubber 

去找(www.7zhao.net欢迎您

I will represent this document on a single line for easier comparison with the two rank-select bit-strings that follow:

本文来自去找www.7zhao.net

"name","age","profession"␤John,30,Code Monkey␤Kyle,40,Data Scrubber
0000001000001000000000000100001001000000000001000010010000000000000
0000000000000000000000000100000000000000000001000000000000000000000 欢迎访问www.7zhao.net 

The first rank-select bit-string marks both delimiters and newlines and helps us find the beginning of fields, whilst the other rank-select bit-string marks newlines only and helps us find the beginning of rows.

去找(www.7zhao.net欢迎您

Having two rank-select bit-strings affords us operations like row count, and field count (either per document or per row) without having to inspect the original document text at all. 欢迎访问www.7zhao.net

If you’re interested with playing with the haskell-works library which works on real bit-vectors (rather than emulation on strings), head over to the project page.

copyright www.7zhao.net

Other ingredients to a fast JSON or CSV parser

A rank-select bit-string by itself won’t get us a fast JSON or CSV parser.

去找(www.7zhao.net欢迎您

We will need high performance rank and select operations for both short and very long bit-strings to make this practical.

www.7zhao.net

We haven’t talked about the implications this has for whole document or streaming parsers either. 去找(www.7zhao.net欢迎您

Nor have we discussed how to build rank-select bit-strings efficiently and then use them to parse the document in a way that is competitive with traditional parsers. 本文来自去找www.7zhao.net

Furthermore, the rank-select bit-string is insufficient for parsing heirarchical document formats such as JSON because it does not allow for tree traversal. www.7zhao.net

Whilst we know the nodes are ordered in pre-order traversal order, we have lost information about parent-child or sibling-sibling relationships between nodes, and so are unable traverse in such a way as to, for example, construct a tree of nodes that resemble the document.

去找(www.7zhao.net欢迎您

An additional index is required to make proper tree-traversal of the document possible. 欢迎访问www.7zhao.net

Alas, these and other questions must remain for a future post.

去找(www.7zhao.net欢迎您

References

copyright www.7zhao.net

www.7zhao.net


本文原文地址:https://haskell-works.github.io/posts/2018-08-01-introduction-to-rank-select-bit-string.html

以上为Introduction to the rank-select bit-string文章的全部内容,若您也有好的文章,欢迎与我们分享!

www.7zhao.net

Copyright ©2008-2017去找网版权所有   皖ICP备12002049号-2 皖公网安备 34088102000435号   关于我们|联系我们| 免责声明|友情链接|网站地图|手机版