Friday, July 30, 2010

GSoC 2010 Data Seeder - Seeding using MapReduce framework

The seeder is currently fully functional running on a live instance and hopefully should be able to seed any provided configuration sheet. The final implementation uses a mapreduce framework[0] for GAE and is highly parallelizable.

I also did some benchmarking on the cloud. I tried seeding 10,100 simple Linkable entities, including a string and a reference property. Running on a 50 tasks / second and having the mapreduce framework set to process 100 tasks / second (default value) finished successfully after around 1 min 40 seconds. That's pretty fast in my opinion. This resulted in about 0.45 CPU hours being consumed in the cloud. This was possible because mapreduce runs multiple map operation in a single task. Perhaps even better speed/efficiency can be used by using mutation pools to do batch updates in the datastore.

The way the current solution is implemented is by starting a new mapreduce job for related models (back-referencing). I believe this is also faster and easier to implement, the downside being that a lot of jobs appear in the status interface. I think this can be fixed by using some sort of smart sharding (already supported in mapreduce) instead of starting new jobs, I'll look into it.

So now that the seeder is working, it's time to get back to working on the web interface. I will add the ability to import an existing configuration sheet, better navigation options and better control when configuring references (like being able to specify a subclass of a specific model etc.) during the next week. If time permits, I'll also try to do add some styling.

No comments:

Post a Comment