Provlet: A Data Provenance and Group Access Control Solution for Scientific Lab Environments

Introduction.

Cyberinfrastructures are technological and sociological solutions to the problem of efficiently connecting laboratories, universities, data, computers, and people with the goal of enabling derivation of novel scientific theories and knowledge. Every cyberinfrastructure requires different considerations of data acquisition, storage, management, integration, mining, visualization, and other computing and information processing services distributed over the Internet and beyond the scope of a single institution. 4Ceed is a solution to addressing these considerations designed for different scientific lab environments.

What’s the problem? 

Available cyberinfrastructures are vulnerable to data manipulation attacks targeting the original data and/or its corresponding metadata stored in the database. It is extremely important to increase the trustworthiness of stored data in the system so that scientists can always trust the experimental results and their corresponding data and metadata in the system. It is also critical to determine the sources of specified data which is vulnerable to inserted attacks raised by different parties during data transmission and processing. Another requirement is also the capability of recovering and recreating lost data from the database when it has been removed intentionally or accidentally.

What’s our solution?

Data provenance can be a solution for handling these problems by recording the history of data generation and its processing steps. It provides audit trails and helps to track the sources and reasons for any problems. Data provenance also increases the data trustworthiness by logging transactions in the log files. Finally, data replication can be done easily by a recipe available for recreating data. Provenance data let the system learn how the data was derived and this can be used for further explorations.

PROVLET project aims to introduce a new service that provides data provenance and group access control to the 4CeeD and its underlying data management system, Clowder. The solution not only defines provenance-based security but also considers securing the provenance data itself. It ensures that all the data and processes are securely accessed, tracked and archived in the system.

Challenges.

We need to address different issues and challenges in this project and to provide a data provenance solution. Collecting the provenance data, processing them, storing the processed logs on disks, securing the log files, retrieving the logs from the database, and analyzing them are some of the challenges we try to address in this project.

Areas of Consideration.

In PROVLET, we define a service which collects the logs and use a lossless compression algorithm to compress the log files before storing them on disk. We also consider the latency and aim to introduce a fast retrieval algorithm to read the compressed log files and search through the stored provenance data efficiently. In PROVLET, we will monitor the access patterns in order to learn how to log the transactions and how to efficiently manage the resources in the system. We will also introduce an algorithm to analyze the log files and learn and detect the group behaviors to use in different applications such as anomaly detection and reporting the suspicious data accesses.

Provenance goes hand in hand with access control and especially group access control where the operation tracking over datasets and spaces will be important. We will keep track of information about access to dataset (not file) and over spaces (not folders) which are the ‘data’ related abstractions how Clowder organizes data. This will be the new angle in comparison to provenance over files and folders of data.


This research is funded by the National Science Foundation, NSF ACI 1835834, project title “ Collaborative Research: CSSI: Framework: Data: Clowder Open Source Customizable Research Data Management, Plus-Plus.”  Any results and opinions are our own and do not represent views of National Science Foundation.