Monday, July 20, 2009

Statistic Module: eight week update

Here the latest update of my work on Statistic Module.

Basically during the last week I almost entirely focused on the backend side. As I already wrote on the blog I was working on some abstraction layer for statistics. The goal of my job was to separate statistics from Python backend code. The situation was that for each single statistic, we needed at least one (but practically a few) functions to process it. For a request to collect statistic, logic looked for a function 'update' + statistic_link_id. At the beginning, it was easy and rather convenient solution, but when the number of statistic grew, the statistic logic ended up in having an awful lot of very similar and short functions. What is more, the worst problem was that every statistic had to be hardcoded in source code. As James Crook pointed out in one of his emails, it was a huge pitfall, because every time we want to add a new statistic, we had to add new code and *redeploy* melange.

Thus, I designed a solution to store some specific information as a json string in statistic model - a new field "instructions" were added. I will describe the meaning of all parameters and dependencies between them in the upcoming days. Generally, some parameters that previously were set by statistic specific functions, are now retrieved from instructions.
Let us take a look a the following example. We have "Students Per Country" statistic, so all students are iterated through and for each of them we checked its country and updated choices list.
Before calling collectStat function, we needed to set up al least two things:
logic: to student_logic (because we iterate students)
choices: to soc.models.countries.COUNTRIES_AND_TERRITORIES.
Of course we needed a special function named updateStudentsPerCountry and we could easily set those parameters there. We could live with that, but now, let us say, we want to add "Mentors Per Country". Previously, we needed to:
1) Add an appropriate entity to the data model.
2) add updateMentorsPerCountry function.
3) set logic to mentor_logic
4) set choices to soc.models.countries.COUNTRIES_AND_TERRITORIES
5) redeploy melange.
A lot, is not it?
Now all that stuff is done by parsing instructions.
Let us take a look at instructions for students_per_country:
instructions = {
"params": {"fields": ["res_country"]},
"field": "country",
"type": "per_field",
"logic": "student"
}
The most important field is "type", because it determines that we are dealing with "per field" statistic. Actually, all statistics before last week had "per field" nature.
Then, we have "logic" which means which logic will be used for iteration through entities. To get the actual statistic, we looked for the value in logics_dict dictionary.
What we still need is choices. So, we have "field" parameter and we look for the choices list in the choices_dict dictionary.
Last but not least, there is "params". It is a dictionary which is passed to collectStat function. Previously it was also set by statistic specific functions.
And basically that is all. Let us consider what we have to do now in oder to add "Mentors Per Country":
1) Add an entity to the data model with the same params as for "Students Per Country", but with "logic": "mentor" instead.
And that is all! No changes to the source code are necessary (I assume that we already have a value for "mentor" in logics_dicts.
I hope you will agree with me that it is simpler now :-)

The next thing I worked on during the last week was dealing with statistics which have no fixed list of choices. Actually I had worked on that some time before, so I had some concepts, but because of instructions usage, I had to make some changes. Here is how it is done now.

Let us say, we want to have Student Proposals Per Organization statistic. Before we start to iterate through student proposals, we have to have a list of all organizations. So, we just iterate through all organization entities (also in batches) and create list of link_ids.

The question is how do we know if a statistic has to have a list of choices dynamically collected and what to collect. The answer is: by instructions of course ;-) The only effort is to add "choices_logic" parameter to the dictionary.

As I said, I will try to provide more information about the rest of parameters soon on the wiki. The most important one is "checker" which allows to filter iterated entities depending on some criteria. For example we can process only those students who have a project assigned.

Some time ago Pawel and Sverre had a meeting about final goals for Statistic Module project. They put a list of statistics which they would really want to have. Some of the statistics were already present, but some were not. So, I also worked on them during the last week.
The new available statistics are:
* Mentors With Project Per Country',
* Mentors Without Project Per Country',
* Organizations Per Program
* Student Projects Per Country
* Student Projects Per Continent
* Student Proposals Per Country
* Student Proposals Per Continent
* Students With Project Per Country
* Students Without Project Per Country
* Students Per Graduation Year
* Students With Project Per Graduation Year
* Students Without Project Per Graduation Year
Note: As I mentioned above, one of the instruction parameters is "checker" which allows to collect all those "with/without" statistics.
* Number of Students
* Number of Mentors
* Number of Student Proposals
* Number of Student Projects
* Number of Organization Admins
* Number of Mentors With Projects
* Number of Students With Projects
* Number of Students With Proposals
The last batch of statistics is let us say: "per nothing". I mean, of course we can have it "per program", but do we really want to?
So I put them in one single entity "Gsoc2009 Overall". Its "type" in instructions is "overall" and such a statistic consists of many subsstatitics. They also use instructions, so it is quite easy to add new.

Currently, they only small statistics that are supported have "type" "number", so now I am going to add another kind: "average", because there are still two statistics left on our mentors' wish list:
* Average number of projects per mentor
* Average number of student proposals per student

As I said, I work almost entirely on backend, the only thing I did for the client side was reducing the number of columns to 2 as mentors suggested.

The last thing: Last week I sent a first bunch of patches. Lennie and Sverre, thank you for the reviews. I will take them into account and will try to send new ones by Wednesday.

No comments:

Post a Comment