dagster.readthedocs.io is currently stale due to availability issues.
New
Improvements to S3 Resource. (Thanks @dwallace0723!)
Better error messages in Dagit.
Better font/styling support in Dagit.
Changed OutputDefinition to take is_required rather than is_optional argument. This is to
remain consistent with changes to Field in 0.7.1 and to avoid confusion
with python's typing and dagster's definition of Optional, which indicates None-ability,
rather than existence. is_optional is deprecated and will be removed in a future version.
Added support for Flower in dagster-k8s.
Added support for environment variable config in dagster-snowflake.
Bugfixes
Improved performance in Dagit waterfall view.
Fixed bug when executing solids downstream of a skipped solid.
Improved navigation experience for pipelines in Dagit.
Fixed for the dagster-aws CLI tool.
Fixed issue starting Dagit without DAGSTER_HOME set on windows.
Fixed pipeline subset execution in partition-based schedules.
There are a substantial number of breaking changes in the 0.7.0 release.
Please see 070_MIGRATION.md for instructions regarding migrating old code.
Scheduler
The scheduler configuration has been moved from the @schedules decorator to DagsterInstance.
Existing schedules that have been running are no longer compatible with current storage. To
migrate, remove the scheduler argument on all @schedules decorators:
Finally, if you had any existing schedules running, delete the existing $DAGSTER_HOME/schedules
directory and run dagster schedule wipe && dagster schedule up to re-instatiate schedules in a
valid state.
The should_execute and environment_dict_fn argument to ScheduleDefinition now have a
required first argument context, representing the ScheduleExecutionContext
Config System Changes
In the config system, Dict has been renamed to Shape; List to Array; Optional to
Noneable; and PermissiveDict to Permissive. The motivation here is to clearly delineate
config use cases versus cases where you are using types as the inputs and outputs of solids as
well as python typing types (for mypy and friends). We believe this will be clearer to users in
addition to simplifying our own implementation and internal abstractions.
Our recommended fix is not to use Shape and Array, but instead to use our new condensed
config specification API. This allow one to use bare dictionaries instead of Shape, lists with
one member instead of Array, bare types instead of Field with a single argument, and python
primitive types (int, bool etc) instead of the dagster equivalents. These result in
dramatically less verbose config specs in most cases.
So instead of
from dagster import Shape, Field, Int, Array, String
# ... code
config=Shape({ # Dict prior to change
'some_int' : Field(Int),
'some_list: Field(Array[String]) # List prior to change
})
one can instead write:
config={'some_int': int, 'some_list': [str]}
No imports and much simpler, cleaner syntax.
config_field is no longer a valid argument on solid, SolidDefinition, ExecutorDefintion,
executor, LoggerDefinition, logger, ResourceDefinition, resource, system_storage, and
SystemStorageDefinition. Use config instead.
For composite solids, the config_fn no longer takes a ConfigMappingContext, and the context
has been deleted. To upgrade, remove the first argument to config_fn.
Field takes a is_required rather than a is_optional argument. This is to avoid confusion
with python's typing and dagster's definition of Optional, which indicates None-ability,
rather than existence. is_optional is deprecated and will be removed in a future version.
Required Resources
All solids, types, and config functions that use a resource must explicitly list that
resource using the argument required_resource_keys. This is to enable efficient
resource management during pipeline execution, especially in a multiprocessing or
remote execution environment.
The @system_storage decorator now requires argument required_resource_keys, which was
previously optional.
Dagster Type System Changes
dagster.Set and dagster.Tuple can no longer be used within the config system.
Dagster types are now instances of DagsterType, rather than a class than inherits from
RuntimeType. Instead of dynamically generating a class to create a custom runtime type, just
create an instance of a DagsterType. The type checking function is now an argument to the
DagsterType, rather than an abstract method that has to be implemented in
a subclass.
RuntimeType has been renamed to DagsterType is now an encouraged API for type creation.
Core type check function of DagsterType can now return a naked bool in addition
to a TypeCheck object.
type_check_fn on DagsterType (formerly type_check and RuntimeType, respectively) now
takes a first argument context of type TypeCheckContext in addition to the second argument of
value.
define_python_dagster_type has been eliminated in favor of PythonObjectDagsterType .
dagster_type has been renamed to usable_as_dagster_type.
as_dagster_type has been removed and similar capabilities added as
make_python_type_usable_as_dagster_type.
PythonObjectDagsterType and usable_as_dagster_type no longer take a type_check argument. If
a custom type_check is needed, use DagsterType.
As a consequence of these changes, if you were previously using dagster_pyspark or
dagster_pandas and expecting Pyspark or Pandas types to work as Dagster types, e.g., in type
annotations to functions decorated with @solid to indicate that they are input or output types
for a solid, you will need to call make_python_type_usable_as_dagster_type from your code in
order to map the Python types to the Dagster types, or just use the Dagster types
(dagster_pandas.DataFrame instead of pandas.DataFrame) directly.
Other
We no longer publish base Docker images. Please see the updated deployment docs for an example
Dockerfile off of which you can work.
step_metadata_fn has been removed from SolidDefinition & @solid.
SolidDefinition & @solid now takes tags and enforces that values are strings or
are safely encoded as JSON. metadata is deprecated and will be removed in a future version.
resource_mapper_fn has been removed from SolidInvocation.
New
Dagit now includes a much richer execution view, with a Gantt-style visualization of step
execution and a live timeline.
Early support for Python 3.8 is now available, and Dagster/Dagit along with many of our libraries
are now tested against 3.8. Note that several of our upstream dependencies have yet to publish
wheels for 3.8 on all platforms, so running on Python 3.8 likely still involves building some
dependencies from source.
dagster/priority tags can now be used to prioritize the order of execution for the built-in
in-process and multiprocess engines.
dagster-postgres storages can now be configured with separate arguments and environment
variables, such as:
run_storage:
module: dagster_postgres.run_storage
class: PostgresRunStorage
config:
postgres_db:
username: test
password:
env: ENV_VAR_FOR_PG_PASSWORD
hostname: localhost
db_name: test
Support for RunLaunchers on DagsterInstance allows for execution to be "launched" outside of
the Dagit/Dagster process. As one example, this is used by dagster-k8s to submit pipeline
execution as a Kubernetes Job.
Added support for adding tags to runs initiated from the Playground view in Dagit.
Added @monthly_schedule decorator.
Added Enum.from_python_enum helper to wrap Python enums for config. (Thanks @kdungs!)
[dagster-bash] The Dagster bash solid factory now passes along kwargs to the underlying
solid construction, and now has a single Nothing input by default to make it easier to create a
sequencing dependency. Also, logs are now buffered by default to make execution less noisy.
[dagster-aws] We've improved our EMR support substantially in this release. The
dagster_aws.emr library now provides an EmrJobRunner with various utilities for creating EMR
clusters, submitting jobs, and waiting for jobs/logs. We also now provide a
emr_pyspark_resource, which together with the new @pyspark_solid decorator makes moving
pyspark execution from your laptop to EMR as simple as changing modes.
[dagster-pandas] Added create_dagster_pandas_dataframe_type, PandasColumn, and
Constraint API's in order for users to create custom types which perform column validation,
dataframe validation, summary statistics emission, and dataframe serialization/deserialization.
[dagster-gcp] GCS is now supported for system storage, as well as being supported with the
Dask executor. (Thanks @habibutsu!) Bigquery solids have also been updated to support the new API.
Bugfix
Ensured that all implementations of RunStorage clean up pipeline run tags when a run
is deleted. Requires a storage migration, using dagster instance migrate.
The multiprocess and Celery engines now handle solid subsets correctly.
The multiprocess and Celery engines will now correctly emit skip events for steps downstream of
failures and other skips.
The @solid and @lambda_solid decorators now correctly wrap their decorated functions, in the
sense of functools.wraps.
Performance improvements in Dagit when working with runs with large configurations.
The Helm chart in dagster_k8s has been hardened against various failure modes and is now
compatible with Helm 2.
SQLite run and event log storages are more robust to concurrent use.
Improvements to error messages and to handling of user code errors in input hydration and output
materialization logic.
Fixed an issue where the Airflow scheduler could hang when attempting to load dagster-airflow
pipelines.
We now handle our SQLAlchemy connections in a more canonical way (thanks @zzztimbo!).
Fixed an issue using S3 system storage with certain custom serialization strategies.
Fixed an issue leaking orphan processes from compute logging.
Fixed an issue leaking semaphores from Dagit.
Setting the raise_error flag in execute_pipeline now actually raises user exceptions instead
of a wrapper type.
Documentation
Our docs have been reorganized and expanded (thanks @habibutsu, @vatervonacht, @zzztimbo). We'd
love feedback and contributions!
Thank you
Thank you to all of the community contributors to this release!! In alphabetical order: @habibutsu,
@kdungs, @vatervonacht, @zzztimbo.
Added the dagster-github library, a community contribution from @Ramshackle-Jamathon and
@k-mahoney!
dagster-celery
Simplified and improved config handling.
An engine event is now emitted when the engine fails to connect to a broker.
Bugfix
Fixes a file descriptor leak when running many concurrent dagster-graphql queries (e.g., for
backfill).
The @pyspark_solid decorator now handles inputs correctly.
The handling of solid compute functions that accept kwargs but which are decorated with explicit
input definitions has been rationalized.
Fixed race conditions in concurrent execution using SQLite event log storage with concurrent
execution, uncovered by upstream improvements in the Python inotify library we use.
Documentation
Improved error messages when using system storages that don't fulfill executor requirements.