Package infinidata
Our top level module
Expand source code
from .infinidata import *
__doc__ = infinidata.__doc__
if hasattr(infinidata, "__all__"):
__all__ = infinidata.__all__
Sub-modules
infinidata.infinidata
-
Our top level module
Classes
class TableView (dict)
-
A view into a table. A table is a collection of columns, each of which has a name along with a dtype and a shape for the entries. E.g.:
tbl_dict = { "foo": np.arange(45*16*2, dtype=np.float32).reshape((45,16,2)), "bar": np.arange(45, dtype=np.int64), "baz": np.array(["hello"] * 45) } tbl = infinidata.TableView(tbl_dict)
Here,
tbl_dict
is a dictionary mapping column names to NumPy arrays. We use it to construct a TableView. The columns are "foo", "bar", and "baz", with dtypesfloat32
,int64
, and string, and shapes(16, 2)
,()
, and()
respectively. The table has 45 rows. The rows of a TableView can be accessed by subscripting with[]
:tv[5]
gets row 5,tv[2:5]
gets rows 2, 3, and 4, andtv[np.array([1, 3, 5])]
gets rows 1, 3, and 5. You can also use a slice with a step size:tv[1:10:2]
gets rows 1, 3, 5, 7, and 9. Fetching a range releases the GIL temporarily, the other subscripting methods do not.Static methods
def batch_iter_concat(views, batch_size, drop_last_batch=False, threads=1, readahead=0)
-
Iterate over the rows of list of
TableView
s in order, without creating a new view. def concat(views)
-
Concatenate multiple
TableView
s together def load_from_disk(dir, filename)
-
Load a
TableView
from disk. Provide a directory and a name, and the TableView along with its dependencies will be mapped. THE INFINIDATA DISK FORMAT IS NOT STABLE.
Methods
def batch_iter(self, /, batch_size, drop_last_batch=False, threads=1, readahead=0)
-
Iterate over the rows of a
TableView
in batches. Thethreads
andreadahead
parameters can speed up loading at the expense of memory usage.Parameters:
batch_size
: The number of rows in each batchdrop_last_batch
: If true, the last batch will be dropped if it's smaller thanbatch_size
threads
: The number of threads to use for parallel loading. Must be at least 1, and less than or equal toreadahead
, unlessreadahead
is 0.readahead
: The maximum number of batches to load ahead of time. Setting this to at least 1 will make data loading asynchronous with your python code that is consuming the iterator- batches will start being loaded as soon as the iterator is created, and will continue being loaded so long as there is space in the readahead buffer.
def new_view(self, /, mapping)
-
Make a new TableView from an existing one, remapping the indices either using an index array or a range. E.g.:
tv = infinidata.TableView({"foo": np.arange(45, dtype=np.int64)}) tv2 = tv.new_view(np.array([1, 3, 5])) tv3 = tv.new_view(slice(1, 10, 2)) tv4 = tv.new_view(slice(None, None, -1))
tv2 is a new view with the 1st, 3rd, and 5th rows of tv. The slice function is equivalent the [start:stop:step] notation used when subscripting. tv3 is a new view with the 1st, 3rd, 5th, 7th, and 9th rows of tv. tv4 is a new view with the rows of tv in reverse order.
def remove_matching_strings(self, /, column, strings_to_remove)
-
Remove rows from the table where a given string column matches an element of a given set. This materializes the full set of retained indices in memory, so it's not suitable for obscenely large numbers of rows. An offline approach is possible, but not implemented.
def save_to_disk(self, /, dir, filename=None)
-
Save a
TableView
to disk. TheTableView
along with all its dependencies will be hardlinked (or copied if the original backing storage is on a different fs) in the destination directory. This is smart enough to avoid duplicating data if you use the same directory multiple times, so if you save two TableViews that share a backing storage, the storage will only be saved once. Of course, if they're on the same fs then hardlinking prevents duplication anyway.THE INFINIDATA DISK FORMAT IS NOT STABLE. This function exists for caching, not permanent storage. If you want to save data permanently, use a different format.
Parameters:
dir
: The directory to save theTableView
and its dependencies in. Directories can be reused across differentTableView
s, doing this will prevent redundant copies.filename
: The name of the file to save theTableView
in. If a name isn't provided you'll just get a bunch of files named by UUID and you'll have a hard time finding what you're looking for.
def select_columns(self, /, columns)
def shuffle(self, /, seed=None)
-
Shuffle the rows of a
TableView
. N.b. this will use enough memory to make a complete index array. An offline approach is possible and maybe necessary if you have an absurd number of rows, but not implemented yet. The memory is freed after the shuffle is complete - the generated index array is stored on disk. def uuid(self, /)
-
Get the UUID of the TableView