Monday 20 February 2017

Graphs in Python

In one of the post from series Python Data Structures I promised, that graphs come to the family, so here they are.
Design, I decided to adjacency list implementation, as underlying data structures there are: a list of vertices and list of lists (list of adjacent vertices for any given). I could have used a symbol table (more operation), but decided to keep things as simple as possible (graph algorithms are complicted on its own). A sample of code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# Simple graph API in Python, implementation uses adjacent lists.
# look also here:
# Classes: Graph, Depth_first_search, Depth_first_paths
# Usage:
# Creating new graph: gr1 = Graph(v) - creates new
# graph with no edges and v vertices;

# Search object: gr2 = Depth_first_search(graph, vertex) -
# creates search object,
# gr2.marked_vertex(vertex) - returns true if
# given vertex is reachable from source(above)

# Path object: gr3 = Depth_first_paths(graph, vertex)-
# creates a new path object,
# gr3.has_path(vertex) - thee same as above
# gr3.path_to(vertex) - returns path from source vertex (to the given)
from collections import deque


class Graph:
    """Class graph, creates a graph - described by integers - number
    of vertices - V : v0, v1, ..., v(V-1)"""

    def __init__(self, v_in):
        """constructor -  takes number of vertices and creates a graph
         with no edges (E = 0) and an empty adjacent lists of vertices"""
        self.V = v_in
        self.E = 0
        self.adj = []
        for i in range(v_in):
            self.adj.append([])

    def V(self):
        """returns number of vertices"""
        return self.V

    def E(self):
        """returns number of edges"""
        return self.E

    def add_edge(self, v, w):
        """adds an edge to the graph, takes two integers (two vertices)
         and creates an edge v,w - by modifying appropriate adjacent lists """
        self.adj[v].append(w)
        self.adj[w].append(v)
        self.E += 1

    def adj_list(self, v):
        """Takes an integer - a graph vertex and returns the adjacency lists of it"""
        return self.adj[v]

    def __str__(self):
        """to string method, prints the graph"""
        s = str(self.V) + " vertices, " + str(self.E) + " edges\n"
        for v in range(self.V):
            s += str(v) + ": "
            for w in self.adj[v]:
                s += str(w) + " "
            s += "\n"
        return s


class Depth_first_search:
    """class depth forst search, creates an object,
    constructor takes graph and a vertex"""

    def __init__(self, gr_obj, v_obj):
        self.marked = [False] * gr_obj.V
        self.cnt = 0
        self.__dfs(gr_obj, v_obj)

    def __dfs(self, gr, v):
        """private depth first search, proceed recursively,
        mutates marked - marks the all possible to reach
         from given (v) vertices; also mutates cnt - number of visited vert"""
        self.marked[v] = True
        self.cnt += 1
        for w in gr.adj_list(v):
            if self.marked[w] == False:
                self.__dfs(gr, w)

    def marked_vertex(self, w):
        """Takes an integer - a graph vertex and returns True if it's reachable
        from vertex v (source)"""
        return self.marked[w]

    def count(self):
        """returns number of visited verticles
        (from given in the constructor vertex)"""
        return self.cnt

Why I did it in the java style? (Additional classes to given graph opertions) Primarly to avoid set and mutate global variables (in that case cnt in class Depth_first_search, I don't know how to solve it using function no object :/), also there is lots of graph tasks and operatons, and, imo, this approach is more clear (Did I really write this?:)). There are  going to be updates on this definitely! Code obviously, also on github. Thanks, till the next time!

Thursday 9 February 2017

Bit Hacks in Go

Some time ago I wrote about little bit tweddlings in C, I, recently, have tried something similar in GO. It's not easy, for example bit shifts (>>) works only for unsigned integers and other differences. For now I reproduced and tested two things (both work on unsigned ints):  
Integer mean without casing an overflow and exponenation by binary decomposition.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func iexp(x uint32, n uint32) uint32 {
   if (n == 0) {
    return 1
   }
   var p, y uint32 
   y = 1;
   p = x;
   for {
    if (n & 1 != 0) {y = p*y}
     n >>= 1;
    if (n == 0) {return y}
     p *= p;
   }
  return 0
 }
Above is about 4 to 5 times faster than math.Pow from library.
Average:

1
2
3
func average(x uint32, y uint32) uint32 {
 return (x & y) + ((x ^ y) >> 1)
}

This time the adventage is not speed, but the fact, that we have the average without overflow (works,  for ex., for:, 2^32 - 1 and 2^32 - 1).  
That's it, code also on github, till the next time!

Tuesday 7 February 2017

Python Pollard's rho Algorithm

I've recently was looking for some number theoretic algorithms in Python and Go. While searching, found this  on the first position on duckduckgo and google but it's not really good. Even for example used (1200), gives wrong answer (missing factor 600).  
Applying small fix:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def pollard_rho(n):
    s = set()
    i = 0
    xi = randint(0, n-1)
    y = xi
    k = 2
    while i < 2 * n:
        i += 1
        xi = ((xi^2) - 1)%n
        d = gcd(y - xi, n)
        if d != 1 and d != n:
            s.add(d)
        if i == k:
            y = xi
            k *= 2
    return sorted(s)

Where gcd is Greater Common Divisor, 2 * n in line 7, makes it work correctly, but it's still slow.
Anybody needs more efficient factorization (and more), I would recommend this library.

Friday 3 February 2017

Python Inverted Index Algorithm

As a continuation of interesting algorithms, today another text processing tool: Inverted Index.  
How it works, briefly. Let's say, that we have a list of documents, maybe large: thousands books, articles or so and we want effectively query this list, looking for information. It's done in the way, that the corpora is saved on disk (or we can use database) and we create a special data structure to do search and then retrieve matched documents (the database part is not the part of this article, I hope I will implement it soon in this project).  
I used Python's dicts and sets, the best would be  go through this:  
I have 2 tests documents:


1
2
3
4
5
d1 = ['i', 'did', 'enact', 'julius', 'caesar', 'i', 'was', 'kill' , 'i', 'the', 'capitol',  'brutus','me']
d1 = set(d1)
d2 = ['i', 'so', 'let', 'it', 'be', 'with' , 'caesar', 'the', 'noble', 'brutus', 'hath', 'told', 'you', 
      'caesar', 'was', 'ambitious']
d2 = set(d2)

After tokenization, we may have:

d1 = ['i', 'did', 'enact', 'julius', 'caesar', 'i', 'was', 'kill' , 'i', 'the', 'capitol',  'brutus','me']
d1 = set(d1)
d2 = ['i', 'so', 'let', 'it', 'be', 'with' , 'caesar', 'the', 'noble', 'brutus', 'hath', 'told', 'you', 
      'caesar', 'was', 'ambitious']
d2 = set(d2)

Now create a, to be searched, documents list.


1
2
3
doc_list = []
doc_list.append(d1)
doc_list.append(d2)

For really big corporas, this (doc_list) should be a function-  fetched and prepared document one by one to create the inverted list structure.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
tokens = d1.union(d2)
inv_index = {}
for x in tokens:
    tmp, tmp1 = find(x, doc_list)
    inv_index[(x, tmp1)] = tmp 

def find(elem, d_list):
    l_return = []
    i = 0
    for x in d_list:
        if elem in x:
            l_return.append(i)
        i += 1
    return (l_return, len(l_return))

def find_key(x, d):
    for y in d.keys():
        if x in y:
            return y
    return "word not in a texts"

Tokens is a set of the all unique words in the all documents, as a keys in the inv_index dict there are tuples: token, frequency: in how many documents the word is present - this is the length of an actual value of the key. Values: as 've said, for every key word it is  a list of documents which contains this word, it's implemented here as a list of integers - zero for the first document, one for the next and so on. The functions find and find_key are just helping to create the dict. Searches, in the documents, are now boolean queries on our dict values, for example:

1
('cat' AND 'lion') AND NOT 'mouse'

In fact I'll do it in slightly different manner, I will parse it and use own, modified Propositional Logic Parser. Here are boolean functions:  


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def NOT(l1):
    indexes = l1
    ret_list = []
    for x in range(len(doc_list)):
        if x not in indexes:
            ret_list.append(x)
    return ret_list

def OR(l1, l2):
    tmp0 = set(l1)
    tmp1 = set(l2)
    indexes = tmp0.union(tmp1)
    ret_list = []
    for x in indexes:
        ret_list.append(x)
    return ret_list

def AND(l1, l2):
    tmp0 = set(l1)
    tmp1 = set(l2)
    indexes = tmp0.intersection(tmp1)
    ret_list = []
    for x in indexes:
        ret_list.append(x)
    return ret_list

This is the parser part:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from porpositional_logic_eval import *
import re
def query_tree_evaluate(tree):
    opers = {'||': OR, '&&': AND, '~': NOT} 
    leftT = tree.getLeftChild()
    rightT = tree.getRightChild()
    #pdb.set_trace()
    if leftT and not rightT:
        fn = opers[tree.getRootVal()]
        return fn(query_tree_evaluate(leftT))
    elif leftT and rightT:
        fn = opers[tree.getRootVal()]
        return fn(query_tree_evaluate(leftT), query_tree_evaluate(rightT))
    else:
        return tree.getRootVal()
    
def build_query_parse_tree(exp):
    exp_list = exp.replace('(', ' ( ').replace(')', ' ) ').replace('~', ' ~ ').split()
    e_tree = BinaryTree('')
    current_tree = e_tree
    for token in exp_list:
        if token == '(':
                current_tree.insertLeft('')
                current_tree = current_tree.getLeftChild()
        elif token in ['||','&&', '->', '==', 'XR']:
            if current_tree.getRootVal() == '~':
                current_tree.getParent().setRootVal(token)
                current_tree.insertRight('')
                current_tree = current_tree.getRightChild()
            else:
                current_tree.setRootVal(token)
                current_tree.insertRight('')
                current_tree = current_tree.getRightChild()
        elif token == '~':
            current_tree.setRootVal('~')
            current_tree.insertLeft('')
            current_tree = current_tree.getLeftChild()
        elif token == ')':
            current_tree = current_tree.getParent()
        elif re.search('[a-zA-z]', token):
            current_tree.setRootVal(inv_index[find_key(token, inv_index)])
            current_tree = current_tree.getParent()
            if current_tree.getRootVal() == '~':
                current_tree = current_tree.getParent()
        else:
            raise ValueError
    return e_tree

Finally, checking how it works, query is:

1
exp = "((i && caesar) && ~julius)"

Making a tree:


1
2
3
4
5
6
7
8
9
tr = build_query_parse_tree(exp)
inorder_traversal(tr)
# output:
[0, 1]
&&
[0, 1]
&&
[0]
~

As seen parser does the job, and asking it:


1
2
3
query_tree_evaluate(tr)
# output:
[1]

Gives as what we expected: the second document (contains 'caesar' and 'i', but not 'julius').     Coplexity, as seen is linear, so it should be pretty quick.  
Hopefully updates here soon! Code, as usually on github. Thank you!