Разработать backend и API для платформы публикации научных данных

Цена договорная
02 октября 2022, 15:43 • 16 откликов • 107 просмотров
Внимание, описание задачи на английском. Присылайте Ваше резюме и оценку стоимости работ в отклике на объявление.

General project description: We are working on a publishing platform for scientific data. Our users should be able to publish datasets with descriptions and group them, search through published entities, download them and receive notifications about new data on interesting topics.

Similar projects to look up:

https://zenodo.org/

https://figshare.com/

Current state: We are planning to split our work into 2 stages:

1. Writing the backend and API for the publishing platform.

2. Writing the user interface.

This task is for the backend and API.

Right now we have defined API requirements and their basic structure. We haven’t written any code yet, so you have carte blanche for suggesting technologies, and also you won’t deal with any legacy code :)

Publishing API & Backend: Main Functionality: Data publishing - upload data with description. We want to support two general cases:

1. Upload files directly.

2. Providing links for files that are hosted elsewhere.

Basic types that we want to support:

1.Tables: CSV, TSV, Microsoft Excel (.xls, .xlsx); Nice to have: OpenDocument/OpenOffice

2.Text-based: TXT, JSON

3.Images: common formats

4.Word documents

5.PDF

6.Dataset group - a group of datasets within the system.

In the future, this list can be extended, so the proposed system design should be extensible for introducing new formats.

As mentioned above, we also want to allow publishing groups of datasets, to organize hierarchical relationships. We also want to provide some metadata in a dataset: author, date of publication, tags, version, links.

Data storing - assigning IDs to published entities, retrieving them by ID, updating the entities. One of the main requirements for storage - it should be efficient for running various search queries for datasets (search by type, running the full-text search, search by publisher, search for linked data (including non-direct neighbors)).

We also should be able to update the parameters of entities like tags and links.

We also want to store all versions, so after an update, we will get a version from the update request (or autoincrement if none) and store the updated and the previous versions. Of course we need to be able to get previous versions too.

Data search - running a search query and providing the results of the search. As it was mentioned earlier, we want to support different types of search:

1. We want to allow to search based on some strict conditions, like: ”search only type: .csv” will result in only CSV files.

2. We want to support full-text search and ranking like: “search for: pasta dataset“ will result in a ranked list where the first results are datasets where a heading contains “pasta”, followed by datasets that mention something like “spaghetti“.

3. We want to combine conditions, e.g.:

”search only type: .csv” && “search for: “pasta dataset”“, and our result would be ranked pasta datasets of only .csv format.

Our basic ideas:

1.Search in the dataset description

2.Search in the first N rows of the dataset

3.Search by author, tags.

4.Search in the range of time

Notifications - subscribe for a search query and run it periodically, then send the update to a user. In the future we want to send updates to a user about new datasets that we found for a query, so we want to be able to create a request that will store a task to periodically run a search request that our user will define.

It’s not urgent right now but this functionality should be kept in mind while designing the solution.