Package datatap

This module provides classes and methods for interacting with dataTap. This includes inspecting individual annotations, creating or importing new annotations, and creating or loading datasets for machine learning.

The visual data management platform from Zensors.


Join for free at app.datatap.dev.

The dataTap Python library is the primary interface for using dataTap's rich data management tools. Create datasets, stream annotations, and analyze model performance all with one library.


Documentation

Full documentation is available at docs.datatap.dev.

Features

  • [x] ⚡ Begin training instantly
  • [x] 🔥 Works with all major ML frameworks (Pytorch, TensorFlow, etc.)
  • [x] 🛰️ Real-time streaming to avoid large dataset downloads
  • [x] 🌐 Universal data format for simple data exchange
  • [x] 🎨 Combine data from multiples sources into a single dataset easily
  • [x] 🧮 Rich ML utilities to compute PR-curves, confusion matrices, and accuracy metrics.
  • [x] 💽 Free access to a variety of open datasets.

Getting Started (Platform)

To begin, select a dataset from the dataTap repository.

Then copy the starter code based on your library preference.

Paste the starter code and start training.

Getting Started (API)

Install the client library.

pip install datatap

Register at app.datatap.dev. Then, go to Settings > Api Keys to find your personal API key.

export DATATAP_API_KEY="XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXX"

Start using open datasets instantly.

from datatap import Api

api = Api()
coco = api.get_default_database().get_repository("_/coco")
dataset = coco.get_dataset("latest")
print("COCO: ", dataset)

Data Streaming Example

import itertools
from datatap import Api

api = Api()
dataset = (api
    .get_default_database()
    .get_repository("_/wider-person")
    .get_dataset("latest")
)

training_stream = dataset_version.stream_split("training")
for annotation in itertools.islice(training_stream, 5):
    print("Received annotation:", annotation)

More Examples

Support and FAQ

Q. How do I resolve a missing API Key?

If you see the error Exception: No API key available. Either provide it or use the [DATATAP_API_KEY] environment variable, then the dataTap library was not able to find your API key. You can find your API key on app.datatap.dev under settings. You can either set it as an environment variable or as the first argument to the Api constructor.

Q. Can dataTap be used offline?

Some functionality can be used offline, such as the droplet utilities and metrics. However, repository access and dataset streaming require internet access, even for local databases.

Q. Is dataTap accepting contributions?

dataTap currently uses a separate code review system for managing contributions. The team is looking into switching that system to GitHub to allow public contributions. Until then, we will actively monitor the GitHub issue tracker to help accomodate the community's needs.

Q. How can I get help using dataTap?

You can post a question in the issue tracker. The dataTap team actively monitors the repository, and will try to get back to you as soon as possible.

Expand source code
"""
This module provides classes and methods for interacting with dataTap.  This includes inspecting individual annotations,
creating or importing new annotations, and creating or loading datasets for machine learning.

.. include:: ../README.md
"""

import sys as _sys

if _sys.version_info < (3, 7):
    print("\x1b[38;5;1mUsing an unsupported python version. Please install Python 3.7 or greater\x1b[0m")
    raise Exception("Invalid python version")

from .api.entities import Api

__all__ = [
    "Api",
    "api",
    "droplet",
    "geometry",
    "template",
    "utils",
]

Sub-modules

datatap.api

The datatap.api module provides two different interfaces for the API …

datatap.comet
datatap.droplet

This module provides classes for working with ML data. Specifically, it provides methods for creating new ML data objects, converting ML data objects …

datatap.examples

Example code

datatap.geometry

This module provides geometric primitives for storing or manipulating ML annotations …

datatap.metrics

The metrics module provides a number of utilities for analyzing droplets in the context of a broader training or evaluation job …

datatap.template

Templates are used to describe how a given annotation (or set of annotations) is structured …

datatap.tf

The datatap.tf module provides utilities for using dataTap with Tensorflow …

datatap.torch

The datatap.torch module provides utilities for using dataTap with PyTorch …

datatap.utils

A collection of primarily internal-use utilities.

Classes

class Api (api_key: Optional[str] = None, uri: Optional[str] = None)

The Api object is the primary method of interacting with the dataTap API.

The Api constructor takes two optional arguments.

The first, api_key, should be the current user's personal API key. In order to encourage good secret practices, this class will use the value found in the DATATAP_API_KEY if no key is passed in. Consider using environment variables or another secret manager for your API keys.

The second argument is uri. This should only be used if you would like to target a different API server than the default. For instance, if you are using a proxy to reach the API, you can use the uri argument to point toward your proxy.

This object encapsulates most of the logic for interacting with API. For instance, to get a list of all datasets that a user has access to, you can run

from datatap import Api

api = Api()
print([
    dataset
    for database in api.get_database_list()
    for dataset in database.get_dataset_list()
])

For more details on the functionality provided by the Api object, take a look at its documentation.

Expand source code
class Api:
    """
    The `Api` object is the primary method of interacting with the dataTap API.

    The `Api` constructor takes two optional arguments.

    The first, `api_key`, should be the current user's personal API key. In
    order to encourage good secret practices, this class will use the value
    found in the `DATATAP_API_KEY` if no key is passed in. Consider using
    environment variables or another secret manager for your API keys.

    The second argument is `uri`. This should only be used if you would like
    to target a different API server than the default. For instance, if you
    are using a proxy to reach the API, you can use the `uri` argument to
    point toward your proxy.

    This object encapsulates most of the logic for interacting with API.
    For instance, to get a list of all datasets that a user has access to,
    you can run

    ```py
    from datatap import Api

    api = Api()
    print([
        dataset
        for database in api.get_database_list()
        for dataset in database.get_dataset_list()
    ])
    ```

    For more details on the functionality provided by the Api object, take
    a look at its documentation.
    """
    def __init__(self, api_key: Optional[str] = None, uri: Optional[str] = None):
        self.endpoints = ApiEndpoints(api_key, uri)

    def get_current_user(self) -> User:
        """
        Returns the current logged-in user.
        """
        return User.from_json(self.endpoints, self.endpoints.user.current())

    def get_database_list(self) -> List[Database]:
        """
        Returns a list of all databases that the current user has access to.
        """
        return [
            Database.from_json(self.endpoints, json_db)
            for json_db in self.endpoints.database.list()
        ]

    def get_default_database(self) -> Database:
        """
        Returns the default database for the user (this defaults to the public
        database).
        """

        # TODO(zwade): Have a way of specifying a per-user default
        current_user = self.get_current_user()
        if current_user.default_database is None:
            raise Exception("Trying to find the default database, but none is specified")

        return self.get_database_by_uid(current_user.default_database)

    def get_database_by_uid(self, uid: str) -> Database:
        """
        Queries a database by its UID and returns it.
        """
        return Database.from_json(self.endpoints, self.endpoints.database.query_by_uid(uid))


    @overload
    def get_database_by_name(self, name: str, allow_multiple: Literal[True]) -> List[Database]: ...
    @overload
    def get_database_by_name(self, name: str, allow_multiple: Literal[False] = False) -> Database: ...
    def get_database_by_name(self, name: str, allow_multiple: bool = False) -> Union[Database, List[Database]]:
        """
        Queries a database by its name and returns it. If `allow_multiple` is true, it will return
        a list of databases.
        """
        database_list = [
            Database.from_json(self.endpoints, database)
            for database in self.endpoints.database.query_by_name(name)
        ]

        if allow_multiple:
            return database_list
        else:
            return assert_one(database_list)

Methods

def get_current_user(self) ‑> User

Returns the current logged-in user.

Expand source code
def get_current_user(self) -> User:
    """
    Returns the current logged-in user.
    """
    return User.from_json(self.endpoints, self.endpoints.user.current())
def get_database_by_name(self, name: str, allow_multiple: bool = False) ‑> Union[Database, List[Database]]

Queries a database by its name and returns it. If allow_multiple is true, it will return a list of databases.

Expand source code
def get_database_by_name(self, name: str, allow_multiple: bool = False) -> Union[Database, List[Database]]:
    """
    Queries a database by its name and returns it. If `allow_multiple` is true, it will return
    a list of databases.
    """
    database_list = [
        Database.from_json(self.endpoints, database)
        for database in self.endpoints.database.query_by_name(name)
    ]

    if allow_multiple:
        return database_list
    else:
        return assert_one(database_list)
def get_database_by_uid(self, uid: str) ‑> Database

Queries a database by its UID and returns it.

Expand source code
def get_database_by_uid(self, uid: str) -> Database:
    """
    Queries a database by its UID and returns it.
    """
    return Database.from_json(self.endpoints, self.endpoints.database.query_by_uid(uid))
def get_database_list(self) ‑> List[Database]

Returns a list of all databases that the current user has access to.

Expand source code
def get_database_list(self) -> List[Database]:
    """
    Returns a list of all databases that the current user has access to.
    """
    return [
        Database.from_json(self.endpoints, json_db)
        for json_db in self.endpoints.database.list()
    ]
def get_default_database(self) ‑> Database

Returns the default database for the user (this defaults to the public database).

Expand source code
def get_default_database(self) -> Database:
    """
    Returns the default database for the user (this defaults to the public
    database).
    """

    # TODO(zwade): Have a way of specifying a per-user default
    current_user = self.get_current_user()
    if current_user.default_database is None:
        raise Exception("Trying to find the default database, but none is specified")

    return self.get_database_by_uid(current_user.default_database)