Project Offerings

blue cubes

Project Offerings for Semester 1, 2015

For 2015, I have the following projects on offer on large-scale distributed data management, next generation data analytics and data management for bio-data:

If you are interested in any of these projects, please contact me by email or in person.

 

Projects on Next-Generation Data Analytics

Data Privacy Analysis of Health Tracking Services

In recent years, personal health tracking services have become very popular. Those systems collect data from personal health sensors, such as health bands, step counters or smart watches, and provide health data analysis via graphical user interfaces. Some services additionally integrate some social networking functionality, for example to share experiences or to provide additional motivation by comparing own health habits with peers. The underlying data processing infrastructure is typically cloud-based: Data is collected locally and then send to a central services that is hosted on some cloud data centers, where the processing, sharing and visualisation is done.

The goal of this project is to compare popular health tracking services with regard to their processing infrastructure from the point of view of data privacy: How is the data collected, where is it processed, and is any data disclosed to other people or even organisations?

For students interested in taking on this project as a research project, this task can be extended to include the design of a distributed health tracking service with guaranteed data privacy and anonymization functionalities.

A Touch Interface for SQL Databases (Honours project)

More and more computing systems are produced with touch interfaces, from smartphones via tablets to the latest versions of desktop operating systems (Windows 8 and Max OS X). At the same time, the basic interface to database systems is still SQL, which is a text-based query language that requires keyboard input and that is hard to learn for novice users.

In our TouchQL project, we aim to develop a query 'language' that is purely based on a graphical schema representation and input gestures and that allows to query a relational database using a tablet computer.

There exists already an initial prototype of TouchQL for Android devices that supports basic selections, projections and natural joins over local databases.

The goal of this Honours project is to extend this system with a mechanism for grouping and aggregation, and also to support querying remote databases. The challenge in the later part is to provide timely feedback to the user for the intended operations as in TouchQL, there is no separation between query formulation and query execution - users shall get immediate feedback on their intended actions on the actual data set. It would be additionally beneficial if the student would be able to port TouchQL from the Java-based Android to the Objective-C based iOS.

 

Projects on Large-Scale Distributed Data Management:

MongoDB with Transactional Memory

An interesting recent development for server-class, multi-core CPUs is hardware transactional memory which allows a CPU to execute short code sections with transactional guarantees: Memory changes are kept only of the whole code section is executed without conflict to parallel threads, otherwise the program execution is reseted transparently to the start of the transactional code and any previous changes are dismissed. This is especially beneficial for the efficient execution of critical sections in multi-threaded programs.

In a previous project, we already investigated the core execution characteristics of Intel's hardware transactional memory for MySQL. The goal of this project is to extend this study to a popular NoSQL database, such as MongoDB. We are interested in identifying code sections which tend to become performance bottlenecks once the core-count of a CPU gets large enough, because extended periods of blocking occur while all threads but one have to wait to enter the critical section (blocking mutex). These section(s) then shall be modified to use the hardware transactional memory extensions and the performance changes being evaluated.

Data Processing on MultiCore Machines: Building a Virtual Database Cluster (Honours project)

Multi-Core computers are becoming increasingly common for large servers. At the end of this year, server CPUs with 128 cores will become available. This poses a real challenge to database engines as those are optimised for concurrent workloads sharing resources and hiding latency, rather than for large numbers of parallel cores that can run many queries completely independent. There's some body of work on optimising databases for distributed systems such as a cluster of databases. In this project, we are interested to learn how those techniques perform if applied to a single multi-core machine that is configured as a 'virtual' cluster by deploying several virtual machines on the same hardware. To this end, we have both a large multi-core machine as well as a small research cluster available as hardware platform. The project student shall compare the performance of an open source DBMS on either platforms for a given workload and develop a new load distribution technique that optimizes the performance of the virtual database cluster.

Database Cluster Management Tool (Software Development: MIT 12cp / TSP / Engineering Project /Undergraduate project for 1-2 students)

We have a small database cluster of 8 nodes which we use for several research projects. It is a multi-boot cluster (Linux and Windows 2003 Server) that can run different database engines, both commercial and open source, such as Oracle, Microsoft SQL Server, and PostgreSQL. We need a platform independent monitoring tool with a GUI that helps us (a) keeping track of the current cluster state and (b) reboot cluster nodes into different configurations. Ideally, it would also include a cluster allocation component to manage our research projects and allow us to use subsets of the cluster concurrently in different projects. This project shall conduct a study and review of corresponding database cluster management tools and set-up a suitable solution, eventually enriched with self-developed software components.

Skills needed:Some experience with programming and databases; Sys-Administration background of advantage

Suitable majors: Databases, Software Engineering, Networking

PowerDB: Freshness-aware Replication in a Database Cluster (12-18cp MIT Project or TSP project)

This project aims to set-up a freshness-aware replication engine for a cluster of databases. It will be based on an existing cluster coordinator called PowerDB that is written in C++ and optimised for SQL Server. The student shall install, configure and optimise this version on our new database cluster running PostgreSQL. The 18cp version of the project will then in addition run some performance and scalability tests on the new system.

Skills needed: Good knowledge in C++ and in databases

Suitable majors: Databases, Software Engineering, Computer Science

 

Projects on Database Support for Bioinformatics:

Bio-Data Processing using Map/Reduce (Honours project or 18cp MIT Research Project)

This project will investigate the suitability of a map/reduce framework for the parallel processing of DNA fragment data (so called 'short reads'). The student shall implement a short-read comparison algorithm on the database research cluster of the DBRG using the open source Hadoop system.

Skills needed: INFO2x20, COMP5138 or equivalent database course (INFO3404 would be perfect); good programming skills; Bioinformatics background not neccessary

Suitable MIT majors: Databases, Software Engineering, Computer Science

I also maintain a list of former projects supervised by me in recent years.