Python Multi-dimensional Array Libraries
----------------------------------------
Python has a rich scientific computing library ecosystem.
The ecosystem's linchpin is `Numpy `_,
which provides accelerated C and FORTRAN operations on multi-dimensional
arrays.
Other libraries such as `Scipy `_ and
`Astropy `_ build on top of Numpy.
CASA Support for NumPy
~~~~~~~~~~~~~~~~~~~~~~
CASA Supports reading and writing Table data into
Numpy arrays via the `python-casacore
`_ library.
.. testcode::
import casacore.tables as pt
from daskms.example_data import example_ms
ms_filename = example_ms()
with pt.table(ms_filename) as T:
ddid = T.getcol("DATA_DESC_ID")
print(ddid)
print(type(ddid))
produces the following output:
.. testoutput::
Successful readonly open of default-locked table /tmp/tmp7wkejl07.ms: 22 columns, 10 rows
[0 0 0 0 1 1 1 1 1 1]
Specific row ranges can be requested:
.. testcode::
with pt.table(ms_filename) as T:
print(T.getcol("DATA_DESC_ID", startrow=2, nrow=4))
.. testoutput::
[0 0 1 1]
If we wish to arbitrarily access variably shaped data, such
as can be present in the DATA column, `getcol` cannot be (simply)
be used as it is not possible to return a single, fixed shape,
numpy array representing all of this data.
Instead we must make a variably shaped data request via `getvarcol`.:
.. testcode::
from pprint import pprint
with pt.table(ms_filename) as T:
data = T.getvarcol("DATA")
pprint({k: v.shape for k, v in data.items()})
This produces a dictionary containing variably shaped numpy arrays
for each row, rather than a single array produced by `getcol`:
.. testoutput::
{'r1': (1, 16, 4),
'r2': (1, 16, 4),
'r3': (1, 16, 4),
'r4': (1, 16, 4),
'r5': (1, 32, 2),
'r6': (1, 32, 2),
'r7': (1, 32, 2),
'r8': (1, 32, 2),
'r9': (1, 32, 2),
'r10': (1, 32, 2)}
However, if we know the first four rows (DATA_DESC_ID = 0) and last
six rows (DATA_DESC_ID = 1) all have the same shape, we can request
data with `getcol`:
.. testcode::
with pt.table(ms_filename) as T:
# DATA_DESC_ID = 0 (4 rows, 16 channels, 4 correlations)
print(T.getcol("DATA", startrow=0, nrow=4).shape)
# DATA_DESC_ID = 1 (6 rows, 32 channels, 2 correlations)
print(T.getcol("DATA", startrow=4, nrow=6).shape)
.. testoutput::
(4, 16, 4)
(6, 32, 2)
Consult the `python-casacore
`_ library for further
information.
Dask
~~~~
`dask `_ is a general
Python parallel programming framework that can distribute work
over multiple cores and nodes. The
`dask Array API `_
provides an interface that mimic's that of Numpy, while conceptually
dividing the underlying data into chunks on which operations are
executed in parallel.
The purpose of dask-ms is to expose CASA Table Column data to
the user as dask arrays in order to facilitate parallel programming
of Radio Astronomy Algorithms.
Xarray
~~~~~~
`xarray `_ groups logically related
numpy and dask arrays into Datasets. Associated dimensions on multiple
arrays can be related to each other, enabling rich data science
applications.
For example, using our example Measurement Set we can do the following:
.. testcode::
from daskms import xds_from_ms
from daskms.example_data import example_ms
datasets = xds_from_ms(example_ms())
print(datasets)
produces a list of two datasets:
.. testoutput::
[
Dimensions: (chan: 16, corr: 4, row: 4, uvw: 3)
Coordinates:
ROWID (row) int32 dask.array
Dimensions without coordinates: chan, corr, row, uvw
Data variables:
UVW (row, uvw) float64 dask.array
TIME (row) float64 dask.array
ANTENNA1 (row) int32 dask.array
ANTENNA2 (row) int32 dask.array
DATA (row, chan, corr) complex64 dask.array
Attributes:
FIELD_ID: 0
DATA_DESC_ID: 0,
Dimensions: (chan: 32, corr: 2, row: 6, uvw: 3)
Coordinates:
ROWID (row) int32 dask.array
Dimensions without coordinates: chan, corr, row, uvw
Data variables:
UVW (row, uvw) float64 dask.array
TIME (row) float64 dask.array
ANTENNA1 (row) int32 dask.array
ANTENNA2 (row) int32 dask.array
DATA (row, chan, corr) complex64 dask.array
Attributes:
FIELD_ID: 0
DATA_DESC_ID: 1
]
Keen-eyed readers will note that the first dataset has 4 rows,
16 channels, 4 correlations and DATA_DESC_ID of 0, while the second has
6 rows, 32 channels, 2 correlations and a DATA_DESC_ID of 1.
Here, rows with the same DATA_DESC_ID have been grouped together
into single dataset allowing a column that, while variably shaped,
has fixed shapes for the same DATA_DESC_ID.
The datasets are also grouped on FIELD_ID, but only one FIELD is present
in this dataset.