Skip to content

The Database Interface Classes

For each API endpoint exposed in the Django app, there is a corresponding class that provide methods to execute CRUD operations asynchronously.

There are two types of API endpoints – those that contain only records data, and those that store both records and pointers to files.

Connecting to the Database

The database currently runs on HTCF service partition. This is a single node with 8 CPU and 30 GB that is meant for long running low resource jobs. The components that need to run are a postgres database, a redis instance and the django app. As long as these components are running on the service partition, you can connect via an ssh tunnel with:

ssh username@login.htcf.wustl.edu -N -L 8001:n240:8000

where the 8001:n240:8000 takes the form of local_port:cluster_node:app_port. The django app will always be served on port 8000, and n240 is the only service partition node. You may choose a different local port.

If you do this and cannot connect, let me know and I’ll check the status of the jobs on the cluster.

Database username and password

Once you have a tunnel, you can access the database frontend at 127.0.0.1:8001 (or a different local port, if you changed that number). If you haven’t already signed up, you’ll need to click the ‘sign up’ button and follow the instructions. The e-mail server is not hooked up at the moment, so when it says “see the e-mail”, send a slack message and let me know. I’ll give you a link to complete the sign up process. After that, you can just use the “sign in” button.

For computational tasks, including generating rank response data, celery workers must be launched on the HTCF general partition. There is currently a script that is meant to monitor the redis queue and launch/kill these workers automatically, but this functionality is new and largely untested. You can monitor the workers/tasks if you create another tunnel with:

ssh username@login.htcf.wustl.edu -N -L 8002:n240:5555

You’d access this dashboard at 127.0.0.1:5555

The login is currently:

username: "flower"
password: "daisy"

(yes, really – the security comes from the fact that you need to login with HTCF)

Configuring the Database Interface Classes

The database classes expect the following environmental variables to be set.

BASE_URL='http://127.0.0.1:8001'
TOKEN='<your token="">'
BINDING_URL='http://127.0.0.1:8001/api/binding'
BINDINGMANUALQC_URL='http://127.0.0.1:8001/api/bindingmanualqc'
CALLINGCARDSBACKGROUND_URL='http://127.0.0.1:8001/api/callingcardsbackground'
DATASOURCE_URL='http://127.0.0.1:8001/api/datasource'
EXPRESSION_URL='http://127.0.0.1:8001/api/expression'
EXPRESSIONMANUALQC_URL='http://127.0.0.1:8001/api/expressionmanualqc'
FILEFORMAT_URL='http://127.0.0.1:8001/api/fileformat'
GENOMICFEATURE_URL='http://127.0.0.1:8001/api/genomicfeature'
PROMOTERSET_URL='http://127.0.0.1:8001/api/promoterset'
PROMOTERSETSIG_URL='http://127.0.0.1:8001/api/promotersetsig'
REGULATOR_URL='http://127.0.0.1:8001/api/regulator'

This can be achieved in the package during development with a .env file at the top most level of the package. The .env file is loaded in the package __init__.py.

If you are importing yeastdnnexplorer into a different environment, then you’ll need to add the package dotenv and execute load_dotenv(dotenv_path=env_path). If the .env file is in the same PWD in which you execute that command, there is no need to specify a path.

Token Authentication

Once you have a username and password to the database, you can retrieve your token. Make sure that you put this token, at least, in a .env file, and make sure that .env file is in your .gitignore.

Alternatively, you could retrieve and store in memory the token at the beginning of each session – this is more secure if you are not using a .env file.

The .env file is already in the yeastddnexplorer .gitignore

curl -X 'POST' \
  'http://127.0.0.1:8001/auth-token/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "username": "username",
  "password": "password"
}'

Or with python:

import requests

url = "http://127.0.0.1:8001/auth-token/"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}
data = {
    "username": "username",
    "password": "password",
}

response = requests.post(url, json=data, headers=headers)
print(response.text)

Using the Interface Classes

from yeastdnnexplorer.interface import *
import matplotlib.pyplot as plt

Records Only Endpoints

The records only endpoints are:

  • BindingManualQC

  • DataSource

  • ExpressionManualQC

  • FileFormat

  • GenomicFeature

  • PromoterSetSig

  • Regulator

When the read() method is called on the corresponding API classes, a dataframe will be returned in the response.

All of the read() methods, for both types of API endpoints, return the result of a callable. By default, the callable returns a dictionary with two keys: metadata and data. For response only tables, the metadata value will be the records from the database as a pandas dataframe and the data will be `None.

Example – RegulatorAPI

regulator = RegulatorAPI()

result = await regulator.read()
result.get("metadata")
id uploader modifier regulator_locus_tag regulator_symbol upload_date modified_date under_development notes genomicfeature
0 1 chasem chasem YAL051W OAF1 2024-07-01 2024-07-01T12:47:18.619129-05:00 False none 24
1 2 chasem chasem YBL103C RTG3 2024-07-01 2024-07-01T12:47:19.667722-05:00 False none 140
2 3 chasem chasem YBL066C SEF1 2024-07-01 2024-07-01T12:47:20.523161-05:00 False none 186
3 4 chasem chasem YBL054W TOD6 2024-07-01 2024-07-01T12:47:21.309606-05:00 False none 199
4 5 chasem chasem YBL052C SAS3 2024-07-01 2024-07-01T12:47:22.161007-05:00 False none 201
... ... ... ... ... ... ... ... ... ... ...
1809 1810 chasem chasem YOR262W GPN2 2024-07-01 2024-07-01T14:14:36.164403-05:00 False none 6387
1810 1811 chasem chasem YPR190C RPC82 2024-07-01 2024-07-01T14:14:38.921261-05:00 False none 7070
1811 1812 chasem chasem YPL228W CET1 2024-07-01 2024-07-01T14:15:51.518999-05:00 False none 6603
1812 1813 chasem chasem YKL049C CSE4 2024-07-01 2024-07-01T14:15:56.555122-05:00 False none 4083
1813 1814 chasem chasem YMR168C CEP3 2024-07-01 2024-07-01T14:22:14.060524-05:00 False none 5258

1814 rows × 10 columns

Record and File Endpoints

The record and file endpoints are the following:

  • CallingCardsBackground

  • Expression

  • PromoterSet

  • PromoterSetSig

  • RankResponse *

The default read() method is the same as the Records only Endpoint API classes. However, there is an additional argument, retrieve_files which if set to True will retrieve the file for which each record provides metadata. The return value of read() is again a callable, and by default the data key will store a dictionary where the keys correspond to the id column in the metadata.

# First, retrieve only the records -- you'll want to filter these results down before
# retrieving the files most likely
pss_api = PromoterSetSigAPI()
result = await pss_api.read()
result.get("metadata")
id uploader modifier single_binding composite_binding fileformat background upload_date modified_date file promoter source regulator_symbol regulator_locus_tag condition rank_recall data_usable background_name
0 1 chasem chasem 1.0 NaN 1 NaN 2024-07-01 2024-07-01T13:51:48.611781-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... NaN harbison_chip OAF1 YAL051W YPD unreviewed unreviewed NaN
1 2 chasem chasem 2.0 NaN 1 NaN 2024-07-01 2024-07-01T13:51:49.643452-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... NaN harbison_chip PDR3 YBL005W YPD unreviewed unreviewed NaN
2 3 chasem chasem 3.0 NaN 1 NaN 2024-07-01 2024-07-01T13:51:50.744384-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... NaN harbison_chip HIR1 YBL008W YPD unreviewed unreviewed NaN
3 4 chasem chasem 4.0 NaN 1 NaN 2024-07-01 2024-07-01T13:51:51.507918-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... NaN harbison_chip HAP3 YBL021C YPD unreviewed unreviewed NaN
4 5 chasem chasem 5.0 NaN 1 NaN 2024-07-01 2024-07-01T13:51:52.277595-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... NaN harbison_chip TOD6 YBL054W YPD unreviewed unreviewed NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2237 7011 admin admin NaN 145.0 5 1.0 2024-07-30 2024-07-30T16:39:36.457965-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4.0 mitra_cc CBF1 YJR060W NaN unreviewed pass ad1
2238 7012 admin admin NaN 146.0 5 1.0 2024-07-30 2024-07-30T16:39:36.848168-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4.0 mitra_cc GCN4 YEL009C NaN unreviewed pass ad1
2239 7013 admin admin NaN 147.0 5 1.0 2024-07-30 2024-07-30T16:39:37.234144-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4.0 mitra_cc OAF1 YAL051W NaN unreviewed pass ad1
2240 7014 admin admin NaN 148.0 5 1.0 2024-07-30 2024-07-30T16:39:38.547155-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4.0 mitra_cc YOX1 YML027W NaN unreviewed pass ad1
2241 7015 admin admin NaN 149.0 5 1.0 2024-07-30 2024-07-30T16:39:39.713590-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4.0 mitra_cc LEU3 YLR451W NaN unreviewed pass ad1

2242 rows × 18 columns

Filtering

All API classes have a params attribute which stores the filtering parameters which will be applied to the HTTP requests.

pss_api.push_params({"regulator_symbol": "GZF3",
                     "workflow": "nf_core_callingcards_1_0_0",
                     "data_usable": "pass"})
pss_api.params
ParamsDict({'regulator_symbol': 'GZF3', 'workflow': 'nf_core_callingcards_1_0_0', 'data_usable': 'pass'})

Retrieving files from a Records and Files Object

To retrieve files from a Records and Files endpoint object, do the following:

# note that retrieve_files is set to True
result = await pss_api.read(retrieve_files = True)

# the metadata slot is the same as before
result.get("metadata")
id uploader modifier single_binding composite_binding fileformat background upload_date modified_date file promoter source regulator_symbol regulator_locus_tag condition rank_recall data_usable background_name
0 6577 admin admin 1837 NaN 5 1 2024-07-01 2024-07-01T16:43:50.145871-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc GZF3 YJL110C unknown pass pass ad1
1 6580 admin admin 1841 NaN 5 1 2024-07-01 2024-07-01T16:43:50.968078-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc GZF3 YJL110C unknown pass pass ad1
2 6642 admin admin 1902 NaN 5 1 2024-07-01 2024-07-01T16:43:54.969507-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc GZF3 YJL110C unknown pass pass ad1
3 6651 admin admin 1911 NaN 5 1 2024-07-01 2024-07-01T16:43:55.326651-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc GZF3 YJL110C unknown pass pass ad1
4 6717 admin admin 1960 NaN 5 1 2024-07-01 2024-07-01T16:44:00.500038-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc GZF3 YJL110C unknown pass pass ad1
# but now the data slot is a dictionary where the `id` are keys and the values
# are the files parsed into pandas dataframes
result.get("data").get("6568")

Filtering on multiple items

Some filters, and eventually all, will accept multiple arguments as a comma separated string without spaces. For example:

pss_api.push_params({"regulator_symbol": "GZF3,RTG3"})

print(pss_api.params)
regulator_symbol: GZF3,RTG3, workflow: nf_core_callingcards_1_0_0, data_usable: pass

Parameters can be removed one by one

print(pss_api.params)
regulator_symbol: GZF3,RTG3, workflow: nf_core_callingcards_1_0_0, data_usable: pass

pss_api.pop_params('data_usable')

print(pss_api.params)
regulator_symbol: GZF3,RTG3, workflow: nf_core_callingcards_1_0_0

or cleared entirely

pss_api.pop_params(None)

print(pss_api.params)


Another example with Expression

This is used to get an expression_id to use in the RankResponseAPI() below

expression = ExpressionAPI()

expression.push_params({"regulator_symbol": "GZF3",
                        "lab": "mcisaac",
                        "time": "15"})

expression_res = await expression.read()

expression_res.get("metadata")
id uploader modifier regulator_id regulator_locus_tag regulator_symbol source_name assay upload_date modified_date ... replicate control mechanism restriction time file notes regulator source promotersetsig_processing
0 2516 chasem chasem 3578 YJL110C GZF3 mcisaac_oe overexpression 2024-07-01 2024-07-01T13:36:13.814201-05:00 ... 1 undefined gev N 15.0 https://yeastregulatorydb-htcf-data.s3.amazona... strain_id:SMY156n ; date:20150101 135 7 False
1 2510 chasem chasem 3578 YJL110C GZF3 mcisaac_oe overexpression 2024-07-01 2024-07-01T13:36:08.810276-05:00 ... 1 undefined gev P 15.0 https://yeastregulatorydb-htcf-data.s3.amazona... strain_id:SMY156 ; date:20150101 135 7 False

2 rows × 22 columns

Rank Response

The rank response endpoint is slightly different than the others. It is implemented asynchronously on the database side, and will run many tasks simultaneously. As such, it uses submit() and retrieve() methods.

Additionally, it is a POST request and all parameters are passed in a json list. If you include, for a given dictionary, multiple items to promoetersetsig_ids and/or expression_ids, those datasets will be aggregated prior to calculating the rank response. The current parameters that may be passed for each are:

  • promotersetsig_ids: a list of promotersetsig_ids. If more than 1, then the data will be aggregated prior to calcuating rank response
  • expression_ids: a list of expression_ids. If more than 1, then the data will be aggregated prior to calcuating rank response
  • expression_effect_colname: name of the column to use for the rank response effect
  • expression_effect_threshold: The threshold to use on abs(effect) to label responsive/ unresponsive genes
  • expression_pvalue_threshold: the threshold to use below which to label responsive/ unresponsive genes
  • rank_bin_size: the size of the bins by which rank response is summarized.
  • rank_by_binding_effect: if this is “true”, then rank by the binding effect first rather than pvalue. This is used for harbison_chip and mcisaac_oe
  • summarize_by_rank_bin: if this is set to false, the unsummarized rank response (merged binding/response) is returned
rr_api = RankResponseAPI()

data = [
    {
        "promotersetsig_ids": ["5555"],
        "expression_ids": ["2510"],
        "rank_by_binding_effect": "true",
    }
]

group_id = await rr_api.submit(post_dict=data)

result = await rr_api.retrieve(group_id)
result.get("metadata")
regulator_symbol promotersetsig_ids expression_ids n_responsive total_expression_genes id
0 GZF3 5555 2510 810 6175.0 39e81d73-a8c5-46ca-8d66-be2c02ccf8b6
result.get("data").get(result.get("metadata").id[0])
rank_bin n_responsive_in_rank random n_successes response_ratio pvalue ci_lower ci_upper
0 5 2 0.131174 2 0.400000 0.131211 0.052745 0.853367
1 10 3 0.131174 5 0.500000 0.005510 0.187086 0.812914
2 15 4 0.131174 9 0.600000 0.000027 0.322870 0.836636
3 20 0 0.131174 9 0.450000 0.000490 0.230578 0.684722
4 25 2 0.131174 11 0.440000 0.000149 0.244024 0.650718
5 30 1 0.131174 12 0.400000 0.000224 0.226558 0.593965
6 35 1 0.131174 13 0.371429 0.000296 0.214732 0.550769
7 40 0 0.131174 13 0.325000 0.001278 0.185729 0.491295
8 45 1 0.131174 14 0.311111 0.001363 0.181659 0.466491
9 50 0 0.131174 14 0.280000 0.004946 0.162311 0.424905
10 55 1 0.131174 15 0.272727 0.004378 0.161380 0.409619
11 60 0 0.131174 15 0.250000 0.011526 0.147186 0.378596
df = result.get("data").get(result.get("metadata").id[0]).iloc[1:50,:]

# Plot
plt.figure(figsize=(10, 6))
plt.plot(df['rank_bin'], df['response_ratio'], marker='o')
plt.title('GZF3 - promotersetsig_id 6577. McIsaac 15 2510')
plt.xlabel('Rank Bin')
plt.ylabel('Response Ratio')
plt.title('Response Ratio vs Rank Bin')
plt.grid(True)
plt.show()
No description has been provided for this image

CallingCards Aggregated Data

For regulators which have data generated in the Brentlab and processed through the nf-core/callingcards:1.0.0 workflow, if that data has been manually (or eventually automatically) QC reviewed, and if there are at least 2 samples which are labeled as data_usable, then there will exist a BindingConcatenated record to which both a BindingManualQC record and a PromoterSetSig record are foreign keyed.

To view the set of samples for which there is aggregated data, you may do the following:

pss_api.pop_params(None)

pss_api.push_params({"source_name": "brent_nf_cc", "aggregated": "true"})

callingcards_aggregated_meta_res = await pss_api.read()

callingcards_aggregated_meta_df = callingcards_aggregated_meta_res.get("metadata")
callingcards_aggregated_meta_df
id uploader modifier single_binding composite_binding fileformat background upload_date modified_date file promoter source regulator_symbol regulator_locus_tag rank_recall data_usable background_name
0 6867 admin admin NaN 6 5 1 2024-07-06 2024-07-06T11:08:33.125644-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc WTM1 YOR230W pass pass ad1
1 6868 admin admin NaN 3 5 1 2024-07-06 2024-07-06T11:08:33.125640-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc MIG2 YGL209W pass pass ad1
2 6869 admin admin NaN 1 5 1 2024-07-06 2024-07-06T11:08:33.119818-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc CAT8 YMR280C pass pass ad1
3 6870 admin admin NaN 5 5 1 2024-07-06 2024-07-06T11:08:33.125605-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc PDR1 YGL013C pass pass ad1
4 6871 admin admin NaN 4 5 1 2024-07-06 2024-07-06T11:08:33.129564-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc PHO4 YFR034C pass pass ad1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
66 6933 admin admin NaN 67 5 1 2024-07-06 2024-07-06T11:08:45.567814-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc ABF2 YMR072W fail pass ad1
67 6934 admin admin NaN 68 5 1 2024-07-06 2024-07-06T11:08:45.921894-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc USV1 YPL230W pass pass ad1
68 6935 admin admin NaN 69 5 1 2024-07-06 2024-07-06T11:08:46.166036-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc MGA1 YGR249W pass pass ad1
69 6936 admin admin NaN 70 5 1 2024-07-06 2024-07-06T11:08:46.246482-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc CIN5 YOR028C pass pass ad1
70 6937 admin admin NaN 71 5 1 2024-07-06 2024-07-06T11:08:47.987665-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc ROX1 YPR065W pass pass ad1

71 rows × 17 columns

pss_api.push_params({"regulator_symbol": "GZF3"})

callingcards_res = await pss_api.read()

gzf3_callingcards_res = callingcards_res.get("metadata")

gzf3_callingcards_res

#gzf3_callingcards_res[~gzf3_callingcards_res.composite_binding_id.isna()]
id uploader modifier single_binding composite_binding fileformat background upload_date modified_date file promoter source regulator_symbol regulator_locus_tag rank_recall data_usable background_name
0 6873 admin admin NaN 7 5 1 2024-07-06 2024-07-06T11:08:33.117009-05:00 https://yeastregulatorydb-htcf-data.s3.amazona... 4 brent_nf_cc GZF3 YJL110C pass pass ad1
data = [
    {
        "promotersetsig_ids": ["6873"],
        "expression_ids": ["2510"],
    }
]

group_id = await rr_api.submit(post_dict=data)

agg_result = await rr_api.retrieve(group_id)

agg_df = agg_result.get("data").get(agg_result.get("metadata").id[0]).iloc[1:50,:]

# Plot
plt.figure(figsize=(10, 6))
plt.plot(agg_df['rank_bin'], agg_df['response_ratio'], marker='o')
plt.title('Aggregated GZF3. McIsaac 15 2510')
plt.xlabel('Rank Bin')
plt.ylabel('Response Ratio')
plt.title('Response Ratio vs Rank Bin')
plt.grid(True)
plt.show()
No description has been provided for this image

Caveats

  1. I have written the scripts to automatically check the redis queue for work and to both launch celery worker nodes, and kill them when they are finished. But, though they work if I run them manually, they have not worked when scheduled through a cronjob. I’ll work with Brian and Eric next week to figure out why.

  2. I haven’t tested each of the endpoint APIs individually. Help is welcome.