Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update straw #95

Merged
merged 6 commits into from
Feb 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 14 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,60 +2,36 @@
Straw is library which allows rapid streaming of contact data from .hic files.
This repository contains source code for the C++, R, Python, and MATLAB versions of Straw.

For the Java version of Straw, see: https://github.com/aidenlab/java-straw/
- For the Java version, see: https://github.com/aidenlab/java-straw/
- For the Javascript version, see: https://github.com/igvteam/hic-straw/

For the Javascript version of Straw, see: https://github.com/igvteam/hic-straw/
There are two Python versions - a pure Python flavor and one which wraps the C++ code with pybind11.
The former version has been deprecated in favor of using the pybind11 version, which is much faster.

There are two Python versions - a pure Python one and a version where the C++ code is bound with pybind11. The former version is deprecated in favor of using the pybind11 version, which is much faster.
- For archival purposes, the pure python version has been moved to: https://github.com/aidenlab/pystraw

A Jupyter notebook example can be found here: https://aidenlab.gitbook.io/juicebox/accessing-raw-data
A Jupyter notebook example of using straw can be found here: https://aidenlab.gitbook.io/juicebox/accessing-raw-data

## Quick Start Python
## Install straw for python

Use `pip install hic-straw`. Or if you want to build from the source code, you must have pybind11 installed. Clone the library and `cd` into the `straw/` directory. Then `pip install ./pybind11_python`.
Use `pip install hic-straw`.
If you want to build from the source code, you must have pybind11 installed.
Clone the library and `cd` into the `straw/` directory. Then `pip install ./pybind11_python`.

You can run your code via `import strawC` (or `hic-straw`) and `strawC.strawC`, for example:

```python
import strawC
result = strawC.strawC('observed', 'NONE', 'HIC001.hic', 'X', 'X', 'BP', 1000000)
for i in range(len(result)):
print("{0}\t{1}\t{2}".format(result[i].binX, result[i].binY, result[i].counts))
```
To query observed/expected data:
```python
import strawC
result = strawC.strawC('oe', 'NONE', 'HIC001.hic', 'X', 'X', 'BP', 1000000)
for i in range(len(result)):
print("{0}\t{1}\t{2}".format(result[i].binX, result[i].binY, result[i].counts))
```

### Usage
```
strawC.strawC(data_type, normalization, file, region_x, region_y, 'BP', resolution)
```

`data_type`: `'observed'` (previous default / "main" data) or `'oe'` (observed/expected)<br>
`normalization`: `NONE`, `VC`, `VC_SQRT`, `KR`, `SCALE`, etc.<br>
`file`: filepath (local or URL)<br>
`region_x/y`: provide the `chromosome` or utilize the syntax `chromosome:start_position:end_position` if using a smaller window within the chromosome<br>
`resolution`: typically `2500000`, `1000000`, `500000`, `100000`, `50000`, `25000`, `10000`, `5000`, etc.<br><br>
Note: the normalization, resolution, and chromosome/regions must already exist in the .hic to be read
(i.e. they are not calculated by straw, only read from the file if available)<br>


## Compile on Linux
## Compile straw for C++

```bash
g++ -std=c++0x -o straw main.cpp straw.cpp -lcurl -lz
```

Please see [the wiki](https://github.com/theaidenlab/straw/wiki) for more documentation.
You must have cURL installed.
Please see [the wiki](https://github.com/aidenlab/straw/wiki) for more documentation.

For questions, please use
[the Google Group](https://groups.google.com/forum/#!forum/3d-genomics).

Ongoing development work is carried out by <a href="http://mshamim.com">Muhammad S. Shamim</a>.
Past contributors include <a href="http://www.cherniavsky.net/neva/">Neva C. Durand</a> and many others.

If you use this tool in your work, please cite

Expand Down
111 changes: 111 additions & 0 deletions pybind11_python/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
## Quick Start Python

Straw is library which allows rapid streaming of contact data from .hic files.
To learn more about Hi-C data and 3D genomics, visit https://aidenlab.gitbook.io/juicebox/

Once you've installed the library with `pip install hic-straw`, you can import your code with `import hicstraw`.

## Legacy usage to fetch list of contacts

The new usage for straw allows you to create objects and retain intermediate variables.
This can speed up your code significantly when querying hundreds or thousands of regions
for a given chromosome/resolution/normalization.

First we import `numpy` and `hicstraw`.
```python
import numpy as np
import hicstraw
```

We then create a Hi-C file object.
From this object, we can query genomeID, chromosomes, and resolutions.
```python
hic = hicstraw.HiCFile("HIC001.hic")
print(hic.getChromosomes())
print(hic.getGenomeID())
print(hic.getResolutions())
```

We can also collect a matrix zoom data object, which is specific to
- specific matrix-type: `observed` (count) or `oe` (observed/expected ratio)
- chromosome-chromosome pair
- resolution
- normalization

This object retains information for fast future queries.
Here's an example that pick the counts from the intrachromosomal region for chr4
with KR normalization at 5kB resolution.
```python
mzd = hic.getMatrixZoomData('4', '4', "observed", "KR", "BP", 5000)
```

We can get numpy matrices for specific genomic windows by calling:
```python
numpy_matrix = mzd.getRecordsAsMatrix(10000000, 12000000, 10000000, 12000000)
```

### Usage
```
hic = hicstraw.HiCFile(filepath)
hic.getChromosomes()
hic.getGenomeID()
hic.getResolutions()

mzd = hic.getMatrixZoomData(chrom1, chrom2, data_type, normalization, "BP", resolution)

numpy_matrix = mzd.getRecordsAsMatrix(gr1, gr2, gc1, gc2)
records_list = mzd.getRecords(gr1, gr2, gc1, gc2)
```

`filepath`: path to file (local or URL)<br>
`data_type`: `'observed'` (previous default / "main" data) or `'oe'` (observed/expected)<br>
`normalization`: `NONE`, `VC`, `VC_SQRT`, `KR`, `SCALE`, etc.<br>
`resolution`: typically `2500000`, `1000000`, `500000`, `100000`, `50000`, `25000`, `10000`, `5000`, etc.<br><br>
Note: the normalization, resolution, and chromosome/regions must already exist in the .hic to be read
(i.e. they are not calculated by straw, only read from the file if available)<br>
`gr1`: start genomic position along rows<br>
`gr2`: end genomic position along rows<br>
`gc1`: start genomic position along columns<br>
`gc2`: end genomic position along columns<br>


## Legacy usage to fetch list of contacts

For example, to fetch a list of all the raw contacts on chrX at 100Kb resolution:

```python
import hicstraw
result = hicstraw.straw('observed', 'NONE', 'HIC001.hic', 'X', 'X', 'BP', 1000000)
for i in range(len(result)):
print("{0}\t{1}\t{2}".format(result[i].binX, result[i].binY, result[i].counts))
```

To fetch a list of KR normalized contacts for the same region:
```python
import hicstraw
result = hicstraw.straw('observed', 'KR', 'HIC001.hic', 'X', 'X', 'BP', 1000000)
for i in range(len(result)):
print("{0}\t{1}\t{2}".format(result[i].binX, result[i].binY, result[i].counts))
```

To query observed/expected KR normalized data:
```python
import hicstraw
result = hicstraw.straw('oe', 'KR', 'HIC001.hic', 'X', 'X', 'BP', 1000000)
for i in range(len(result)):
print("{0}\t{1}\t{2}".format(result[i].binX, result[i].binY, result[i].counts))
```

### Usage
```
hicstraw.straw(data_type, normalization, file, region_x, region_y, 'BP', resolution)
```

`data_type`: `'observed'` (previous default / "main" data) or `'oe'` (observed/expected)<br>
`normalization`: `NONE`, `VC`, `VC_SQRT`, `KR`, `SCALE`, etc.<br>
`file`: filepath (local or URL)<br>
`region_x/y`: provide the `chromosome` or utilize the syntax `chromosome:start_position:end_position` if using a smaller window within the chromosome<br>
`resolution`: typically `2500000`, `1000000`, `500000`, `100000`, `50000`, `25000`, `10000`, `5000`, etc.<br><br>
Note: the normalization, resolution, and chromosome/regions must already exist in the .hic to be read
(i.e. they are not calculated by straw, only read from the file if available)<br>

18 changes: 13 additions & 5 deletions pybind11_python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,16 @@
from setuptools.command.build_ext import build_ext
import sys
import setuptools
import os

__version__ = '1.1.0'


class get_pybind_include(object):
def read(fname):
return open(os.path.join(os.path.dirname(__file__), fname)).read()


class GetPybindInclude(object):
"""Helper class to determine the pybind11 include path

The purpose of this class is to postpone importing pybind11
Expand All @@ -27,8 +32,8 @@ def __str__(self):
['src/straw.cpp'],
include_dirs=[
# Path to pybind11 headers
get_pybind_include(),
get_pybind_include(user=True)
GetPybindInclude(),
GetPybindInclude(user=True)
],
language='c++'
),
Expand Down Expand Up @@ -97,19 +102,22 @@ def build_extensions(self):
ext.extra_link_args = link_opts
build_ext.build_extensions(self)


setup(
name='hicstraw',
version=__version__,
author='Neva C. Durand, Muhammad S Shamim',
author_email='neva@broadinstitute.org',
license='MIT',
keywords=['Hi-C', '3D Genomics', 'Chromatin', 'ML'],
url='https://github.com/aidenlab/straw',
description='Straw bound with pybind11',
long_description='',
long_description=read('README.md'),
long_description_content_type='text/markdown',
ext_modules=ext_modules,
install_requires=['pybind11>=2.4'],
setup_requires=['pybind11>=2.4'],
python_requires='>3.3',
cmdclass={'build_ext': BuildExt},
zip_safe=False,
)