Thursday, December 5, 2013

Python - A hack to process a long list of data faster

I've learned a hack to process a long list of data faster from a colleague. It's just great. Here is the case:

Assuming I have list of thousands of users info to send email and do some other stuffs with those users.

1. First, when I loop through each user, translate the username to number with:

def str_to_num(input_str):
    return_num = 0
    for ch in input_str:
        return_num += ord(ch)

    return return_num

2. Get the result of:

num = translated_username % 10

3. If the result of the previous calculation is in the input condition (when I run the script), then execute the main function (email,...)

if num in condition_number:

The main function:


def get_dict_data_from_csv_file(csv_file_path):
    csv_file = open(csv_file_path, 'rb')
    sniffdialect = csv.Sniffer().sniff(, delimiters='\t,;')
    dict_reader = csv.DictReader(csv_file, dialect=sniffdialect)
    dict_data = []
    for record in dict_reader:


    return dict_data

def main_function(src_data, condition_number):

    for dat in src_data:
         num = str_to_num(dat['username']) % 10
         if num in condition_number:

if __name__ == '__main__':
    src_data = get_dict_data_from_csv(sys.argv[1])

    condition_number = map(int, sys.argv[2].split(','))

    main_function(src_data, condition_number)

Now I can split the whole list of data and process simultaneously with the input condition range from 0 to 9:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9

For example, I can run the script in 5 different terminal windows with 5 different groups of input condition such as:

Terminal #0:

$ python myscript /home/trinh/Documents/data.csv 0,1

Terminal #1:

$ python myscript /home/trinh/Documents/data.csv 2,3

Terminal #2:

$ python myscript /home/trinh/Documents/data.csv 4,5

Terminal #3:

$ python myscript /home/trinh/Documents/data.csv 6,7

Terminal #4:

$ python myscript /home/trinh/Documents/data.csv 8,9

It means that I can do the job 5 times faster than normal.