Package `infinidata`

Our top level module

Expand source code

from .infinidata import *

__doc__ = infinidata.__doc__
if hasattr(infinidata, "__all__"):
    __all__ = infinidata.__all__

Sub-modules

infinidata.infinidata: Our top level module

Classes

class TableView (dict)

A view into a table. A table is a collection of columns, each of which has a name along with a dtype and a shape for the entries. E.g.:

tbl_dict = {
  "foo": np.arange(45*16*2, dtype=np.float32).reshape((45,16,2)),
  "bar": np.arange(45, dtype=np.int64),
  "baz": np.array(["hello"] * 45)
}
tbl = infinidata.TableView(tbl_dict)

Here, tbl_dict is a dictionary mapping column names to NumPy arrays. We use it to construct a TableView. The columns are "foo", "bar", and "baz", with dtypes float32, int64, and string, and shapes (16, 2), (), and () respectively. The table has 45 rows. The rows of a TableView can be accessed by subscripting with []: tv[5] gets row 5, tv[2:5] gets rows 2, 3, and 4, and tv[np.array([1, 3, 5])] gets rows 1, 3, and 5. You can also use a slice with a step size: tv[1:10:2] gets rows 1, 3, 5, 7, and 9. Fetching a range releases the GIL temporarily, the other subscripting methods do not.

Static methods

def batch_iter_concat(views, batch_size, drop_last_batch=False, threads=1, readahead=0): Iterate over the rows of list of TableViews in order, without creating a new view.
def concat(views): Concatenate multiple TableViews together
def load_from_disk(dir, filename): Load a TableView from disk. Provide a directory and a name, and the TableView along with its dependencies will be mapped. THE INFINIDATA DISK FORMAT IS NOT STABLE.

Methods

def batch_iter(self, /, batch_size, drop_last_batch=False, threads=1, readahead=0)

Iterate over the rows of a TableView in batches. The threads and readahead parameters can speed up loading at the expense of memory usage.

Parameters:

batch_size: The number of rows in each batch
drop_last_batch: If true, the last batch will be dropped if it's smaller than batch_size
threads: The number of threads to use for parallel loading. Must be at least 1, and less than or equal to readahead, unless readahead is 0.
readahead: The maximum number of batches to load ahead of time. Setting this to at least 1 will make data loading asynchronous with your python code that is consuming the iterator
batches will start being loaded as soon as the iterator is created, and will continue being loaded so long as there is space in the readahead buffer.

def new_view(self, /, mapping)

Make a new TableView from an existing one, remapping the indices either using an index array or a range. E.g.:

tv = infinidata.TableView({"foo": np.arange(45, dtype=np.int64)})
tv2 = tv.new_view(np.array([1, 3, 5]))
tv3 = tv.new_view(slice(1, 10, 2))
tv4 = tv.new_view(slice(None, None, -1))

tv2 is a new view with the 1st, 3rd, and 5th rows of tv. The slice function is equivalent the [start:stop:step] notation used when subscripting. tv3 is a new view with the 1st, 3rd, 5th, 7th, and 9th rows of tv. tv4 is a new view with the rows of tv in reverse order.

def remove_matching_strings(self, /, column, strings_to_remove)

Remove rows from the table where a given string column matches an element of a given set. This materializes the full set of retained indices in memory, so it's not suitable for obscenely large numbers of rows. An offline approach is possible, but not implemented.

def save_to_disk(self, /, dir, filename=None)

Save a TableView to disk. The TableView along with all its dependencies will be hardlinked (or copied if the original backing storage is on a different fs) in the destination directory. This is smart enough to avoid duplicating data if you use the same directory multiple times, so if you save two TableViews that share a backing storage, the storage will only be saved once. Of course, if they're on the same fs then hardlinking prevents duplication anyway.

THE INFINIDATA DISK FORMAT IS NOT STABLE. This function exists for caching, not permanent storage. If you want to save data permanently, use a different format.

Parameters:

dir: The directory to save the TableView and its dependencies in. Directories can be reused across different TableViews, doing this will prevent redundant copies.
filename: The name of the file to save the TableView in. If a name isn't provided you'll just get a bunch of files named by UUID and you'll have a hard time finding what you're looking for.

def select_columns(self, /, columns)

Select the columns to be viewed, returning a new TableView. Subscripting the TableView and using batch_iter will return dicts with only those columns. Can improve performance by doing less reading if you're only using some of the columns

def shuffle(self, /, seed=None)

Shuffle the rows of a TableView. N.b. this will use enough memory to make a complete index array. An offline approach is possible and maybe necessary if you have an absurd number of rows, but not implemented yet. The memory is freed after the shuffle is complete - the generated index array is stored on disk.

def uuid(self, /)

Get the UUID of the TableView