Monday, June 29, 2009

Statistic module - 4th and 5th week update

First of all, let me apologize for not updating the blog a week ago. I know there are no excuses - I was just putting off this task so many times that the whole week passed.

To begin with, I will try to describe my 'two-weeks-ago' update.

First of all, I added support for some more sophisticated statistics. The question is how to define which statistics are in this group? At first, let us take a look at a simple statistic that had been already correctly counted. For example Students Per Degree. Collection process for such a statistic looks like this: we have static list of choices for degrees ('undergraduate', 'master', 'PhD') and we iterate through all student entities (in batches of course because of Google Application Engine limitations), for each entity we check degree field and increase value in the appropriate basket. But unfortunately we do not know the list of choices before starting collection. Let us take a look at statistic like Applications Per Students. In this case we could also have static choices (numbers from 1 through 20, because every student may submit up to 20 application) and gather data by iterating through all student entities, and for each student iterate through all student proposals entities, check how many of them belongs to the student. Anyway, it would be at least as inefficient as bubble sort for large data input - its complexity could be
O(proposals * students).
Of course there is better way: to have a list of all students as choices, iterate through student proposals and for each of them increase number connected with the student in scope. There is just one problem: we do not have the list of all students at the beginning, but we have a simple walk-around: at the beginning list of choices is empty and we add students to this list dynamically. When we process a single proposal, we check if the student in scope is already in the list. If so, we increase his number by one, otherwise we add him and set his number to one. This is quite a smooth way. Of course I know this algorithm is not linear depending on number of proposals, because we look up for a student in dictionary, but in worst case it works like
O(proposals * log(_students_)),
where _students_ is a number of students who submitted at least one proposal, but this attempt is better than the first one.
Anyway, it still has a disadvantage: we have no information about the students who did not submit any proposals - and there is certainly a bunch of them. For other similar tasks this problem is not so substantial. For example Applications Per Organization: we may assume that each organization receives at least one proposal; for Students Per Program we may also assume that for each program (at least hosted by Google:-) someone registers himself, and so on. Nonetheless, Mario and I decided that all statistics should be fully covered.
Therefore, the statistics, which we do not know their full list of choices for, will be collected in the following way: first, we collect full list of choices (also in batches, because their number can be large) and after that we collect actual data. For example, let us consider Applications Per Student again: first we iterate through all student entities and then, having a full list of students, we iterate through proposals and mach proposals with the students.
Anyway, this way has not been merged into our repository yet. Collecting statistic this way in batches would be very awkward, because the number of batches automatically increases. Anyway, it is ready and may be merged quickly after a conversation with mentors and/or other developers.

Some other things which were achieved during that week includes: creating standard views for statistic entities, like create, list, delete; adding support for a few new statistics.

And last but not least, Mario and made some important decisions. Firstly, Mario set up Issue Tracker on our bitbucket repository. Secondly, we decided to organize daily meetings. Thirdly, we postponed abstraction of statistics until we add support for statistics for surveys.

And now, let me move on to the very last week. First of all, I was sick on Thursday and Friday so I could not finally do everything that I had planned.
The most important thing that has been done is that statistics are now collected using Task Queue API. I tried to use this API wisely and make the code at least a little bit reusable, because this API will be probably useful also for other problems that Melange is to encounter in the future. I am going to describe my solution in the wiki and on dev list. It would be great if I got some feedback so as to improve it. The most important thing about it is that we can divide a long task into smaller subtasks and then start a task. When we execute one task and find that the whole task is not completed, we may repeat the same task or start a new one. Statistic collection may be just an example, but as we know, they are collected in batches. So we start a task which collects the first batch saves its data and because the whole collection is not completed, it restarts the task. User is obligated just to turn on the first task and does not have to click several times.

This week I also started getting involved in Java Script side of the module. Mario is the one who created the whole skeleton and I had to understand his conception. Until now I have taken a good look at his code and gone through a jQuery tutorial . Today Mario briefly described me the skeleton (thank you for that!:) and I am going to write some code on my own very soon.

No comments:

Post a Comment