Reading Data with the OpenCosmo Toolkit

Once you're query is complete, you can download the data to your local computer or another machine as an hdf5 file. Chances are you now want to read that data and do additional work with it, like making a plot or performing additional queries.

The OpenCosmo Toolkit can read and write files in the OpenCosmo data format, as well as perform additional queries and complex computations. For a full discussion of the toolkit's capabilities, see the documentation. We'll give a basic overview here.

Opening data

Opening data with the toolkit is straightforward:

import opencosmo as oc

ds = oc.open("my_data.hdf5")

For simple catalog quries, the ds variable is an OpenCosmo Dataset. Note that ds does not actually contain the data that is in the file, it just allows you to interact with it. To request all of the data simply call get_data

data = ds.get_data()

This will output an astropy table of the data.

For more complex queries (such as particle queries) the file may contain more than a single dataset. In these cases ds will not be a simple dataset, but may be a collection such as a StructureCollection. See the full documentation for more info.

Accessing Simulation Information

An opencosmo dataset or collection contain metadata such as the cosmology the simulation was run with, as well as information such as the box size. You can access the cosmology of the simulation directly using the .cosmology attribute:

cosmology = ds.cosmology

And simulation parameters with

sim_params = ds.simulation

Performing Additional Queries

It may be that you requested more data from the portal than you actually need, or you may want to slice it up in a way that makes sense for your particular analysis. The OpenCosmo toolkit allows you to perform additional queries easily:


min_mass_filter = oc.col("fof_halo_mass") > 1e14

ds = ds.filter(min_mass_filter)
data = ds.get_data()

Note that when we perform the filter, the returned value is a new OpenCosmo dataset. Operations on OpenCosmo datasets always return a new datset, rather than modifying the dataset you already have. Because of this, you can create multiple new datasets from a single parent dataset:


min_mass_filter = oc.col("fof_halo_mass") > 1e14
max_mass_filter = oc.col("fof_halo_mass") < 1e14

ds_high_mass = ds.filter(min_mass_filter)
ds_low_mass = ds.filter(max_mass_filter)

Selecting Subsets of Columns

It's quite likely that you don't need all the columns in the dataset. Reading all of would be slow and use up our computer's memory for no reason. We can selct subsets of columsn from the dataset with the select method:

ds = ds.select(("fof_halo_mass", "sod_halo_mass", "sod_halo_cdelta"))

Getting Rows and Sorting

You may want to test your analysis on a small number of objects before applying it to the full dataset. We can easily get just a few rows from the dataset with the take function. For example, we can get a thousand random halos from our catalog with:

ds = ds.take(100, at="random")

It's also possible you may want to select a particluar 100 halos, such as the 100 most massive halos in the dataset. We can do this by combining take with sort_by.

ds = ds.sort_by("fof_halo_mass", invert=True).take(100, at="start")

Note that OpenCosmo, like numpy and astropy sorts in ascending order (least to greatest) by default. In this example, we wanted the 100 most massive halos, so we sorted in descending order (greatest to least) by setting invert = true.

Combining Columns

You can combine columns in the dataset into new columns that are relevant to your work. For example:

fof_halo_px = oc.col("fof_halo_mass")*oc.col("fof_halo_com_vx")
ds = ds.with_new_columns(fof_halo_px = fof_halo_px)

The new dataset will contain a column named fof_halo_px that you can then use as you would any other column. Because the toolkit is lazy, with_new_columns does not actually create a new column in memory. Instead, it simply computes the values that belong in the column when the data is requested.

This has a couple of advantages. If you now perform a filter that removes a large number of rows and then request data, the fof_halo_px column values will only be computed for the rows that pass your filter. This requires less memory and computation (potentially substantially more for very large datasets) than would be required if we computed the values right away.

Many columns you may want to compute will be too complicated to express with simple geometry. For such cases, you can use the evaluate method. For more information, see the full documentation.

Structure Collections

If you performed a halo particle query (or a galaxy query with include halos = true, OpenCosmo will open your file as a StructureCollection rather than a Dataset. A StructureCollection contains multiple datasets that are related in some way and group into structures (halos or galaxies). For example, a collection of halos (and their properties) as well as each halo's particles.

The StructureCollection will automatically determine which particles belong to which halos, leaving you to focus on your analysis. For example, you can loop through all the halos with their associated particles:

collection = oc.open("my_collection.hdf5")
for halo in collection.halos():
    halo_properties = halo["halo_properties"]
    dm_particles = halo["dm_particles"]
    particle_dx = halo_properties["fof_halo_center_x"] - dm_particles.select("x").get_data()

In each iteration of the loop halo_properties will be a dictionary of the halo's properties, and dm_particles will be an OpenCosmo dataset containing the particle data for that halo.

Like datasets, you can use filter, take, select, and with_new_columns on StructureCollections, though there will be some subtle differences since a StructureCollection contains multiple dataset. See StructureCollection documentation for more info.

Next Steps

The OpenCosmo toolkit contains a wealth of additional functionality that was not convered in this brief guide, such as automatic parallelization with MPI. You can read more in the full documentation.

We are always working to improve the toolkit and its capabilities. If you run into bug, or have a feature that would make it easier for you to perform your science, feel free to raise an issue on the GitHub repo..