Pipeline with Conda¶
Note
For new development setups, using Pixi directly is recommended. Pixi provides a reproducible, lock-file-based environment that is simpler to maintain than a manually managed Conda environment.
Warning
Running the pipeline data-processing workflow from a modular CASA 6 setup is not officially supported or validated for observatory operations. The information provided here is for development and demonstration purposes only.
Step-by-step¶
Install Miniforge or Micromamba: below we use miniforge3 installer as examples, which only includes the conda-forge channel by default.
#!/bin/bash # 1. Detect OS and Architecture OS=$(uname -s | sed 's/Darwin/MacOSX/') ARCH=$(uname -m) # 2. Construct the download URL URL="https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-${OS}-${ARCH}.sh" # 3. Download the installer echo "Downloading Miniforge for ${OS}-${ARCH}..." curl -L "$URL" -o miniforge.sh # 4. Run the installer (adjust the installation path as needed) echo "Installing to /opt/miniforge3..." bash miniforge.sh -b -f -p /opt/miniforge3 # 5. Update and Cleanup conda update --all conda clean -a -y rm miniforge.sh echo "Installation complete."
Reproduce a Python environment with modular CASA6 components and the dependency libraries required by them and Pipeline, e.g., openmpi.
Fetch source code:
git clone https://open-bitbucket.nrao.edu/scm/pipe/pipeline.git cd pipeline
Generate
environment.ymlon demand from pixi (environment.ymlis no longer committed to the repository — see Pixi and Running Pixi tasks for details):pixi project export conda-environment \ | sed 's|^- pip:$|- pip:\n - --extra-index-url https://casa-pip.nrao.edu/repository/pypi-group/simple|' \ > environment.yml
The
sedstep injects the NRAO CASA pip index URL, which pixi's export drops by default. If pixi is not yet installed, see Pixi and Running Pixi tasks for the one-line installer.Create or update the Conda environment using the generated file:
conda env create --name pipeline --file=environment.yml
This will create a Conda environment named
pipeline.Note
To update or remove a pre-existing environment:
conda env update --name pipeline --file=environment.yml # Update existing conda env update --name pipeline --file=environment.yml --prune # Also remove unlisted packages conda env remove --name pipeline # Remove the environment entirely
Activate the environment and verify the CASA6 software stack installation:
conda activate pipeline # Create the default CASA data directory if it doesn't exist (can be customized later) mkdir -p ~/.casa/data # Verify casatools installation/functionality; CASA data could be fetched from internet if not present locally by casaconfig # Also see: https://casadocs.readthedocs.io/en/stable/api/casaconfig.html python -c "import casatools; print('casatools version:', casatools.version_string())"
Install Pipeline:
pip install .
To install Pipeline along with optional dependencies for developmental and experimental purposes in editable mode, try:
pip install -e .[dev,docs]
Our ReadtheDocs setup of Pipeline uses this approach for documentation builds (see .readthedocs.yaml)
Note
pyproject.toml and requirements.txt
environment.ymlis no longer committed to the repository. Generate it on demand withpixi project export conda-environmentas shown above (see Pixi and Running Pixi tasks). Its purpose is to define a self-contained Python environment with all CASA6 components and dependencies required by Pipeline.pyproject.toml handles Pipeline packaging and build system requirements.
A separate requirements.txt handles Pipeline core/functional dependencies. The separation is intentional for balancing different needs / use cases, e.g. monolithic and modular CASA6 builds, developer/testing installation setups, etc.
Run Pipeline¶
Typical use patterns of Pipeline include running within a headless environment, or on a workstation interactively, either in CASA serial or parallel mode:
For an interactive use case, one could simply run this to start a casashell session:
conda activate pipeline
python -m casashell
For headless sessions to execute automated Pipeline data processing:
conda activate pipeline
xvfb-run -a python -m casashell --nologger --log2term --agg -c run_pipeline.py
Here run_pipeline.py is a Python script. Example content could be:
import pipeline.recipereducer, os
pipeline.recipereducer.reduce(vis=['../rawdata/uid___A002_Xc46ab2_X15ae_repSPW_spw16_17_small.ms'],
procedure='procedure_hifa_calimage.xml', loglevel='debug')
or alternatively:
import pipeline
pipeline.initcli()
context = h_init()
context.set_state('ProjectStructure', 'recipe_name', 'hifa_calimage')
try:
hifa_importdata(vis=['uid___A002_Xc46ab2_X15ae_repSPW_spw16_17_small.ms'], session=['default'], dbservice=True)
hifa_flagdata()
hifa_fluxcalflag()
hif_rawflagchans()
hif_refant()
h_tsyscal()
hifa_tsysflag()
hifa_tsysflagcontamination()
hifa_antpos()
hifa_wvrgcalflag()
hif_lowgainflag()
hif_setmodels()
hifa_bandpassflag()
hifa_bandpass()
hifa_spwphaseup()
hifa_gfluxscaleflag()
hifa_gfluxscale()
hifa_timegaincal()
hifa_renorm(createcaltable=True, atm_auto_exclude=True)
hifa_targetflag()
hif_applycal()
hif_makeimlist(intent='PHASE,BANDPASS,AMPLITUDE')
hif_makeimages()
hif_makeimlist(intent='CHECK', per_eb=True)
hif_makeimages()
hifa_imageprecheck()
hif_checkproductsize(maxcubesize=40.0, maxcubelimit=60.0, maxproductsize=500.0)
finally:
h_save()
Below are some examples of more detailed managed ways to run the Pipeline.
Serial¶
A plain Python session without invoking casashell:
PYTHONNOUSERSITE=1 OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 xvfb-run -a python ../scripts/run_pipeline.py
Here we isolate the user site-packages by setting the
PYTHONNOUSERSITEenvironment variable to1to avoid potential package conflicts. We also setOMP_NUM_THREADSandOPENBLAS_NUM_THREADSto control the number of threads used by OpenMP/OpenBlas-enabled libraries (e.g.,numpy,scipy,casatools, etc.) during the Pipeline processing.A session via casashell, with CASA6 logging and plotting enabled:
PYTHONNOUSERSITE=1 OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 xvfb-run -a \ python -m casashell --nologger --log2term --agg -c ../scripts/run_pipeline.py
Parallel¶
A standard Python session with casashell invoked:
PYTHONNOUSERSITE=1 OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 \ mpirun --mca btl_vader_single_copy_mechanism none -x OMP_NUM_THREADS -x OPENBLAS_NUM_THREADS -x PRTE_MCA_quiet -np 4 \ python -c "import casampi.private.start_mpi; exec(open('run_pipeline.py').read())"
casampi.private.start_mpi
As discussed/examined in CAS-14037, casashell include some configuration-dependent (modular vs. monolithic) environment initialization to help casampi set up the client and server roles for different openmpi processes while avoid circular imports during the casampi process initialization. Without casashell involvement, you need to execute casampi.private.start_mpi outside the scope of casatasks (casashell implicitly import casatasks in monolithic casa distributions). As a workaround, include the following boilerplate command at the beginning of your workflow script.
try: import casampi.private.start_mpi # assign the client and server roles import casatasks # ensure the time-based logfile name except (ImportError, RuntimeError) as error: pass
Alternatively, as the above example shows, prepend them into a one-liner command with the
-coption of thepythonexecutable. If you run a parallel CASA session without going through casashell (e.g.,`mpirun -n 4 python run_script.py`), place the code snippet above at the beginning of your Python script before any casatasks import actions to avoid deadlocks.The consequence of not doing so is that all openmpi processes will be initialized in the same way and instructed to execute the content of your script concurrently, without the expected
1 x mpiclient + (nproc-1) x mpiserverconfiguration.A session via casashell, with CASA6 logging and plotting enabled:
PYTHONNOUSERSITE=1 OMP_NUM_THREADS=1 xvfb-run -a \ mpirun -display-allocation -display-map -oversubscribe --mca btl_vader_single_copy_mechanism none -x OMP_NUM_THREADS -n 4 \ python -c "import casampi.private.start_mpi; import casashell.__main__" --nologger --log2term --agg -c ../scripts/run_pipeline.py
If you run a parallel CASA session with casashell, you need to add the code snippet inside
~/.casa/config.py. Failure to do so will result in a deadlock the first time casatasks is imported. Note that we usepython -c "import casampi.private.start_mpi; import casashell.__main__"instead ofpython -m casashellso that start_mpi runs before casatasks is imported. the first time casatasks is imported in a MPI server process, it will attempt to start a new MPI environment, leading to a deadlock situation.
Running a parallel CASA session from macOS
A parallel Pipeline data processing session might hang on macOS at the completion of the job due to lingering casaplotms.app sub-processes. This behavior appears to be different from Linux, potentially caused by the fact that each
casaplotmsprocess spawned from a MPIserver process runs as a macOS "app". Although this doesn’t affect the data processing, to ensure a clean exit, one might need to use the following snippet at the end of your Python job script:def close_plotms_on_mpiservers(): try: from casampi.MPIEnvironment import MPIEnvironment from casampi.MPICommandClient import MPICommandClient client = MPICommandClient() mpi_server_list = MPIEnvironment.mpi_server_rank_list() client.push_command_request('from casaplotms import plotmstool', block=True, target_server=mpi_server_list) rs_list = client.push_command_request('plotmstool.__proc!=None', block=True, target_server=mpi_server_list) servers_with_active_plotms = [rs['server'] for rs in rs_list if rs['ret']] if servers_with_active_plotms: print(f'servers with active plotms instances: {servers_with_active_plotms}') client.push_command_request('plotmstool.__proc.kill()', block=True, target_server=servers_with_active_plotms) except Exception: pass close_plotms_on_mpiservers()
In addition,
xvfb-runis not available on macOS, even if xvfb/X11 is installed; therefore, you may not be able to use it for headless sessions. Additionally, to complete a Pipeline processing session requiringcasaplotms, one must log in remotely with GUI access. ThecasaplotmsGUI will appear in the desktop environment but cannot be forwarded via X11.
Useful shorthand¶
Useful aliases/shortcuts to emulate monolithic CASA executables:
conda activate pipeline
export casa_omp_num_threads=4
export casa_mpi_nproc=4
export TMPDIR=/tmp
export casa6_opts_custom='--nologger --log2term --agg'
export mpirun_custom='mpirun -display-allocation -display-map -oversubscribe --mca btl_vader_single_copy_mechanism none --mca btl ^openib -x OMP_NUM_THREADS -x PYTHONNOUSERSITE'
export xvfb_run_auto='xvfb-run -a' # Debian, Ubuntu, RedHat8, etc.
alias casa6='PYTHONNOUSERSITE=1 OMP_NUM_THREADS=${casa_omp_num_threads} python -m casashell'
alias casa6mpi='PYTHONNOUSERSITE=1 OMP_NUM_THREADS=1 ${mpirun_custom} -n ${casa_mpi_nproc} python -c "import casampi.private.start_mpi; import casashell.__main__"'
# For Linux only, not applicable on macOS
alias casa6_xvfb='PYTHONNOUSERSITE=1 OMP_NUM_THREADS=${casa_omp_num_threads} ${xvfb_run_auto} python -m casashell'
alias casa6mpi_xvfb='PYTHONNOUSERSITE=1 OMP_NUM_THREADS=1 ${xvfb_run_auto} ${mpirun_custom} -n ${casa_mpi_nproc} python -c "import casampi.private.start_mpi; import casashell.__main__"'
For executing a headless parallel Pipeline processing session on Linux, one could try:
casa6mpi_xvfb ${casa6_opts_custom} -c ../scripts/run_pltest.py
If you prefer running with an 8-core mpicasa session (1 client + 7 servers), you could do:
casa_mpi_nproc=8 casa6mpi_xvfb ${casa6_opts_custom} -c ../scripts/run_pltest.py