Query using tiledbsoma

The first guide showed how to query for AnnData objects.

This guide queries “Census”, i.e., a tiledbsoma array store that concatenates many AnnData objects.

Load your LaminDB instance for quering data:

!lamin load laminlabs/cellxgene
💡 connected lamindb: laminlabs/cellxgene
import lamindb as ln
import bionty as bt
import tiledbsoma

census_version = "2024-07-01"
💡 connected lamindb: laminlabs/cellxgene

Query data

Create look ups so that we can auto-complete valid values:

features = ln.Feature.lookup(return_field="name")
assays = bt.ExperimentalFactor.lookup(return_field="name")
cell_types = bt.CellType.lookup(return_field="name")
tissues = bt.Tissue.lookup(return_field="name")
ulabels = ln.ULabel.lookup()
suspension_types = ulabels.is_suspension_type.children.all().lookup(return_field="name")

Create a query expression for a tiledbsoma array store.

value_filter = (
    f'{features.tissue} == "{tissues.brain}" and {features.cell_type} in'
    f' ["{cell_types.microglial_cell}", "{cell_types.neuron}"] and'
    f' {features.suspension_type} == "{suspension_types.cell}" and {features.assay} =='
    f' "{assays.ln_10x_3_v3}"'
)
value_filter
'tissue == "brain" and cell_type in ["microglial cell", "neuron"] and suspension_type == "cell" and assay == "10x 3\' v3"'

Query for the tiledbsoma array store that contains all concatenated expression data.

census = ln.Artifact.filter(description=f"Census {census_version}").one()

Query slices within the array store. (This will run a lot faster from within the AWS us-west-2 data center.)

human = "homo_sapiens"  # subset to human data

# open the array store for queries
with census.open() as store:
    # read SOMADataFrame as a slice
    cell_metadata = store["census_data"][human].obs.read(value_filter=value_filter)
    # concatenate results to pyarrow.Table
    cell_metadata = cell_metadata.concat()
    # convert to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()

cell_metadata.shape
Hide code cell output
(66418, 28)
cell_metadata.head()
Hide code cell output
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id ... tissue tissue_ontology_term_id tissue_type tissue_general tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars
0 48182177 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... brain UBERON:0000955 tissue brain UBERON:0000955 15204.0 3959 3.840364 209.374207 27229
1 48182178 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... brain UBERON:0000955 tissue brain UBERON:0000955 39230.0 5885 6.666100 875.502870 27229
2 48182185 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... brain UBERON:0000955 tissue brain UBERON:0000955 9576.0 2738 3.497443 121.333753 27229
3 48182187 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... brain UBERON:0000955 tissue brain UBERON:0000955 19374.0 4096 4.729980 464.331956 27229
4 48182188 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... brain UBERON:0000955 tissue brain UBERON:0000955 8466.0 2477 3.417844 162.555950 27229

5 rows × 28 columns

Create an AnnData

with census.open() as store:
    
    experiment = store["census_data"][human]
    
    adata = experiment.axis_query(
        "RNA",
        obs_query=tiledbsoma.AxisQuery(value_filter=value_filter)
    ).to_anndata(
        X_name="raw",
        column_names={
            "obs": [
                features.assay,
                features.cell_type,
                features.tissue,
                features.disease,
                features.suspension_type,
            ]
        }
    )
adata.var = adata.var.set_index("feature_id")
adata
Hide code cell output
AnnData object with n_obs × n_vars = 66418 × 60530
    obs: 'assay', 'cell_type', 'tissue', 'disease', 'suspension_type'
    var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
adata.var.head()
Hide code cell output
soma_joinid feature_name feature_length nnz n_measured_obs
feature_id
ENSG00000000003 0 TSPAN6 4530 4530448 73855064
ENSG00000000005 1 TNMD 1476 236059 61201828
ENSG00000000419 2 DPM1 9276 17576462 74159149
ENSG00000000457 3 SCYL3 6883 9117322 73988868
ENSG00000000460 4 C1orf112 5970 6287794 73636201
adata.obs.head()
Hide code cell output
assay cell_type tissue disease suspension_type
0 10x 3' v3 microglial cell brain glioblastoma cell
1 10x 3' v3 microglial cell brain glioblastoma cell
2 10x 3' v3 microglial cell brain glioblastoma cell
3 10x 3' v3 microglial cell brain glioblastoma cell
4 10x 3' v3 microglial cell brain glioblastoma cell