Monday 23 January 2017

Text Segmentation (Maximum Matching) in Python

Today another algorithm in the set Algorithms in Python, part one here - maximum matching - it's a text segmentation algorithm - separates word in a  text, with laguages with no clear word separator, like Chinesse. It's simple, that's why works only for short words texts, again, an example is  Chinesse. This is a code:  

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
D = ['danny', 'condo', 'a', 'the','to','has', 'been', 'unable', 'go', 'at']


def max_match(sentence, dictionary):
    if not sentence:
        return ""
    for i in range(len(sentence), -1, -1):
        first_word = sentence[:i]
        remainder = sentence[i:]
        if first_word in dictionary:
            return first_word + " " + max_match(remainder, dictionary)
    first_word = sentence[0]
    remainder = sentence[1:]
    return first_word + max_match(remainder, dictionary)

print(max_match('atcondogo', D))

Let's go through it, dictionary is a simple Python list, algorithm starts at the beginning of a input string and try to match, with the dictionary, the longest word - line 8 (in that case this is a whole string and the reminder is empty), if it succeed returns first word with added space bar separator and recursive call from the remainder; if not, pointer goes one char backwards and  so on in this loop. If no word matches, pointer advances one character (creates one character word), and is recursively applied to the rest of the string - lines 12 - 14.
Line 16 corectrly prints:  

1
"at condo go" 
 But if we add the word 'atcon' to the dictionary, algorithm incorrectly segments to:

1
"atcon dogo"

Not very usefull to English though. That's it, code of course goes to github, together with: Extended Euclicid, modular exponeneation and the longest palindrome finder. Enjoy!

Thursday 19 January 2017

Algorithms in Python

I've decided to start another repo: Python Algorithms, during my programming works I've collected some amount of code which is stacked in different places as a useful functions, algorithms, even Project Euler Helper - designed to only this website. I'm going to slowly check, test that stuff (my goal is, that this is suppose to be solid and ready to work code) and put it on github. Let's bring some algorithms then!  

First edit distance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def edit_dist_dp(str1, str2, m, n):
    store = [[0 for x in range(n + 1)] for x in range(m + 1)]

    for i in range(m + 1):
        for j in range(n + 1):

            if i == 0:
                store[i][j] = j

            elif j == 0:
                store[i][j] = i

            elif str1[i - 1] == str2[j - 1]:
                store[i][j] = store[i - 1][j - 1]


            else:
                store[i][j] = 1 + min(store[i][j - 1], store[i - 1][j],
                                   store[i - 1][j - 1])

    return store[m][n]

This is dynamic programming version, naive, recursive is extremely ineffective. What is does is, according to dp paradigm, first it creates an array to store temporary values (line 2, where m and n are input strings lengths accordingly) and then in a loop starts filling it bottom up, i.e. counting minimum edit distance.
If the first string is empty we have only insert all the characters of the second string (lines 7, 8, #j operations). If the second string is empty, than we insert all the first string characters (line 13, 14, time complexity i).
If the last characters are the same we ignore them and move to remaining string(lines 12, 14).  
If the last characters are different, we have to rise distance by at least one plus recursively find minimum between the all possibilities: Insert, remove and replace (lines 17 to 20). Complexity of the whole thing is O(m*n).  

Another is simple, but nice algorithm to shuffle elements of a given array.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# algorithm for shuffling given subscriptable data structure (Knuth)

import random


def swap(alist, i, j):
    """swaps input lists i, j elements"""
    alist[i], alist[j] = alist[j], alist[i]


def shuffle(data):
    """randomly shuffles element in the input data"""
    n = len(data)
    for token in range(n - 1):
        swap(data, token, random.randrange(token, n))

Maybe it's obvious (in that case sorry for that:)), but I think the reason for that algorithm being published is, that is easy to make a mistake and write the last line like this:

1
swap(data, token, random.randrange(token, n))

Which is definitely no good:)  

 Thats it for now, more stuff on github, incuding clasics merge sort, quicksort and binary search; obvioususly more algorithms will come, stay in tuch!

Friday 13 January 2017

Web Page Crawler in Python

Again, here is another jupyter notebook guest post - this time it's simple recursive web crawler and html parser (using n ary trees). Code also here. Cheers!
PS Python Data Structures updated as well

Tuesday 10 January 2017

Python (Numpy) Multivariable Linear Regression Update!

I've found some bugs and other issues with my previous version of Gradient Descent Algorithm, so I updated, tested(I'm convinced, that this time is OK, feel free to test it folks), and published it hereGithub code changed too!

Saturday 7 January 2017

Tuesday 3 January 2017

Daily Notes

From New Year I've decided to create a github repo and put programming works, notes, etc daily; this maybe a cool idea for learning - possibility to easy come back, make a backup of drafts. It's here.