
    M/Ph                         d Z ddlZddlZddlZddlmZ ddlm	Z	 ddl
mZ dZ G d d          Z G d	 d
          ZdZdZ G d d          Z G d de          ZdS )a#  
Overview
--------

This module implements the Multiple Imputation through Chained
Equations (MICE) approach to handling missing data in statistical data
analyses. The approach has the following steps:

0. Impute each missing value with the mean of the observed values of
the same variable.

1. For each variable in the data set with missing values (termed the
'focus variable'), do the following:

1a. Fit an 'imputation model', which is a regression model for the
focus variable, regressed on the observed and (current) imputed values
of some or all of the other variables.

1b. Impute the missing values for the focus variable.  Currently this
imputation must use the 'predictive mean matching' (pmm) procedure.

2. Once all variables have been imputed, fit the 'analysis model' to
the data set.

3. Repeat steps 1-2 multiple times and combine the results using a
'combining rule' to produce point estimates of all parameters in the
analysis model and standard errors for them.

The imputations for each variable are based on an imputation model
that is specified via a model class and a formula for the regression
relationship.  The default model is OLS, with a formula specifying
main effects for all other variables.

The MICE procedure can be used in one of two ways:

* If the goal is only to produce imputed data sets, the MICEData class
can be used to wrap a data frame, providing facilities for doing the
imputation.  Summary plots are available for assessing the performance
of the imputation.

* If the imputed data sets are to be used to fit an additional
'analysis model', a MICE instance can be used.  After specifying the
MICE instance and running it, the results are combined using the
`combine` method.  Results and various summary plots are then
available.

Terminology
-----------

The primary goal of the analysis is usually to fit and perform
inference using an 'analysis model'. If an analysis model is not
specified, then imputed datasets are produced for later use.

The MICE procedure involves a family of imputation models.  There is
one imputation model for each variable with missing values.  An
imputation model may be conditioned on all or a subset of the
remaining variables, using main effects, transformations,
interactions, etc. as desired.

A 'perturbation method' is a method for setting the parameter estimate
in an imputation model.  The 'gaussian' perturbation method first fits
the model (usually using maximum likelihood, but it could use any
statsmodels fit procedure), then sets the parameter vector equal to a
draw from the Gaussian approximation to the sampling distribution for
the fit.  The 'bootstrap' perturbation method sets the parameter
vector equal to a fitted parameter vector obtained when fitting the
conditional model to a bootstrapped version of the data set.

Class structure
---------------

There are two main classes in the module:

* 'MICEData' wraps a Pandas dataframe, incorporating information about
  the imputation model for each variable with missing values. It can
  be used to produce multiply imputed data sets that are to be further
  processed or distributed to other researchers.  A number of plotting
  procedures are provided to visualize the imputation results and
  missing data patterns.  The `history_func` hook allows any features
  of interest of the imputed data sets to be saved for further
  analysis.

* 'MICE' takes both a 'MICEData' object and an analysis model
  specification.  It runs the multiple imputation, fits the analysis
  models, and combines the results to produce a `MICEResults` object.
  The summary method of this results object can be used to see the key
  estimands and inferential quantities.

Notes
-----

By default, to conserve memory 'MICEData' saves very little
information from one iteration to the next.  The data set passed by
the user is copied on entry, but then is over-written each time new
imputations are produced.  If using 'MICE', the fitted
analysis models and results are saved.  MICEData includes a
`history_callback` hook that allows arbitrary information from the
intermediate datasets to be saved for future use.

References
----------

JL Schafer: 'Multiple Imputation: A Primer', Stat Methods Med Res,
1999.

TE Raghunathan et al.: 'A Multivariate Technique for Multiply
Imputing Missing Values Using a Sequence of Regression Models', Survey
Methodology, 2001.

SAS Institute: 'Predictive Mean Matching Method for Monotone Missing
Data', SAS 9.2 User's Guide, 2014.

A Gelman et al.: 'Multiple Imputation with Diagnostics (mi) in R:
Opening Windows into the Black Box', Journal of Statistical Software,
2009.
    N)LikelihoodModelResults)OLS)defaultdictz
    >>> imp = mice.MICEData(data)
    >>> imp.set_imputer('x1', formula='x2 + np.square(x2) + x3')
    >>> for j in range(20):
    ...     imp.update_all()
    ...     imp.data.to_csv('data%02d.csv' % j)c                       e Zd ZdZd ZdS )PatsyFormulazM
    A simple wrapper for a string to be interpreted as a Patsy formula.
    c                     d|z   | _         d S )Nz0 + )formula)selfr	   s     [/var/www/html/test/jupyter/venv/lib/python3.11/site-packages/statsmodels/imputation/mice.py__init__zPatsyFormula.__init__   s    '    N)__name__
__module____qualname____doc__r    r   r   r   r      s-         ( ( ( ( (r   r   c                       e Zd Zd                    e          Z	 	 d!dZd Zd Zd	 Z		 	 	 d"dZ
d Zd#dZd Zd Zd Z	 	 	 	 	 d$dZ	 	 d%dZ	 	 	 d%dZ	 	 d&dZd Zd Zd Zd Zd Zd Zd Zd  ZdS )'MICEDataa      Wrap a data set to allow missing data handling with MICE.

    Parameters
    ----------
    data : Pandas data frame
        The data set, which is copied internally.
    perturbation_method : str
        The default perturbation method
    k_pmm : int
        The number of nearest neighbors to use during predictive mean
        matching.  Can also be specified in `fit`.
    history_callback : function
        A function that is called after each complete imputation
        cycle.  The return value is appended to `history`.  The
        MICEData object is passed as the sole argument to
        `history_callback`.

    Notes
    -----
    Allowed perturbation methods are 'gaussian' (the model parameters
    are set to a draw from the Gaussian approximation to the posterior
    distribution), and 'boot' (the model parameters are set to the
    estimated values obtained when fitting a bootstrapped version of
    the data set).

    `history_callback` can be implemented to have side effects such as
    saving the current imputed data set to disk.

    Examples
    --------
    Draw 20 imputations from a data set called `data` and save them in
    separate files with filename pattern `dataXX.csv`.  The variables
    other than `x1` are imputed using linear models fit with OLS, with
    mean structures containing main effects of all other variables in
    `data`.  The variable named `x1` has a conditional mean structure
    that includes an additional term for x2^2.
    {_mice_data_example_1}
    )_mice_data_example_1gaussian   Nc                     |j         j        t          j        d          k    rd}t          |          t	                       _        |                    d                              d           _        | _	        g  _
        i  _        t          fd           _        i  _        i  _         j        j         D ]9}                      j        |                   \  }}| j        |<   | j        |<   :i  _        i  _        i  _        t          t                     _        t          t                     _        i  _        i  _        |j         D ]}	                     |	           t3          |j                    fdD             }
t          j        |
          }
t          j        |
          }|t9          |
d	k              d          }fd
|D              _                                          | _        d S )NOz0MICEData data column names should be string typeall)howT)dropc                       S Nr   )perturbation_methods   r   <lambda>z#MICEData.__init__.<locals>.<lambda>   s    /B r   c                 D    g | ]}t          j        |                   S r   )lenix_miss).0vr
   s     r   
<listcomp>z%MICEData.__init__.<locals>.<listcomp>   s'    666!T\!_%%666r   r   c                      g | ]
}|         S r   r   )r$   ivnamess     r   r&   z%MICEData.__init__.<locals>.<listcomp>   s    3331VAY333r   ) columnsdtypenp
ValueErrordictregularizeddropnareset_indexdatahistory_callbackhistorypredict_kwdsr   r   ix_obsr#   _split_indicesmodelsresultsconditional_formula	init_kwdsfit_kwdsmodel_classparamsset_imputerlistasarrayargsortsum_cycle_order_initial_imputationk_pmm)r
   r2   r   rF   r3   msgcolr6   r#   vnamenmissiir)   s   ` `         @r   r   zMICEData.__init__   s    <#..DCS//!66 KKEK**66D6AA	 0 $/ 0C 0C 0C 0C $D $D 
 9$ 	( 	(C"11$)C.AAOFG%DK 'DL  $&  %T**#D))   \ 	$ 	$EU#### dl##6666v666
5!!ZEQJ  !3333333  """


r   c                 :    |                      d           | j        S )a  
        Returns the next imputed dataset in the imputation process.

        Returns
        -------
        data : array_like
            An imputed dataset from the MICE chain.

        Notes
        -----
        `MICEData` does not have a `skip` parameter.  Consecutive
        values returned by `next_sample` are immediately consecutive
        in the imputation chain.

        The returned value is a reference to the data attribute of
        the class and should be copied before making any changes.
           )
update_allr2   )r
   s    r   next_samplezMICEData.next_sample   s    & 	yr   c                 @   i }| j         j        D ]r}| j         |         | j         |                                         z
  }t          j        |          }|                                }| j         |         j        |         ||<   s| j                             |d           dS )z
        Use a PMM-like procedure for initial imputed values.

        For each variable, missing values are imputed as the observed
        value that is closest to the mean over all observed values.
        T)inplaceN)r2   r*   meanr,   absidxminlocfillna)r
   
imp_valuesrH   diixs        r   rE   zMICEData._initial_imputation  s     
9$ 	5 	5C3$)C."5"5"7"77BBB"in04JsOO	T22222r   c                     t          j        |          }t          j        |           }t          j        |          }t	          |          dk    rt          d          ||fS )Nr   z-variable to be imputed has no observed values)pdisnullr,   flatnonzeror"   r-   )r
   vecnullr6   r#   s        r   r7   zMICEData._split_indices!  sY    y~~&&.&&v;;!LMMMwr   Fc
                 f   |>fd| j         j        D             }
dz   d                    |
          z   }|| j        <   ndz   |z   }|| j        <   |t          | j        <   n
|| j        <   |
|| j        <   |
|| j        <   |
|| j        <   |
|| j	        <   || _
        |	| j        <   dS )a  
        Specify the imputation process for a single variable.

        Parameters
        ----------
        endog_name : str
            Name of the variable to be imputed.
        formula : str
            Conditional formula for imputation. Defaults to a formula
            with main effects for all other variables in dataset.  The
            formula should only include an expression for the mean
            structure, e.g. use 'x1 + x2' not 'x4 ~ x1 + x2'.
        model_class : statsmodels model
            Conditional model for imputation. Defaults to OLS.  See below
            for more information.
        init_kwds : dit-like
            Keyword arguments passed to the model init method.
        fit_kwds : dict-like
            Keyword arguments passed to the model fit method.
        predict_kwds : dict-like
            Keyword arguments passed to the model predict method.
        k_pmm : int
            Determines number of neighboring observations from which
            to randomly sample when using predictive mean matching.
        perturbation_method : str
            Either 'gaussian' or 'bootstrap'. Determines the method
            for perturbing parameters in the imputation model.  If
            None, uses the default specified at class initialization.
        regularized : dict
            If regularized[name]=True, `fit_regularized` rather than
            `fit` is called when fitting imputation models for this
            variable.  When regularized[name]=True for any variable,
            perturbation_method must be set to boot.

        Notes
        -----
        The model class must meet the following conditions:
            * A model must have a 'fit' method that returns an object.
            * The object returned from `fit` must have a `params` attribute
              that is an array-like object.
            * The object returned from `fit` must have a cov_params method
              that returns a square array-like object.
            * The model must have a `predict` method.
        Nc                      g | ]
}|k    |S r   r   )r$   x
endog_names     r   r&   z(MICEData.set_imputer.<locals>.<listcomp>Z  s)     0 0 0! J .r   z ~ z + )r2   r*   joinr:   r   r=   r;   r<   r5   r   rF   r/   )r
   rc   r	   r=   r;   r<   r5   rF   r   r/   main_effectsfmls    `          r   r?   zMICEData.set_imputer)  s   ` ?0 0 0 0ty'8 0 0 0Lu$uzz,'?'??C36D$Z00u$w.C36D$Z0+.DZ((+6DZ( )2DN:&(0DM*%#,8Dj)*3FD$Z0
'2$$$r   c                     | j         |         }t          |          dk    rBt          j        |          | j        j        || j        j                            |          f<   dS dS )z
        Fill in dataset with imputed values.

        Parameters
        ----------
        col : str
            Name of variable to be filled in.
        vals : ndarray
            Array of imputed values to use for filling-in missing values.
        r   N)r#   r"   r,   
atleast_1dr2   ilocr*   get_loc)r
   rH   valsrY   s       r   _store_changeszMICEData._store_changesv  s\     \#r77Q;;ACtATATDIN2ty088===>>> ;r   rM   c                     t          |          D ]!}| j        D ]}|                     |           "| j        1|                     |           }| j                            |           dS dS )aU  
        Perform a specified number of MICE iterations.

        Parameters
        ----------
        n_iter : int
            The number of updates to perform.  Only the result of the
            final update will be available.

        Notes
        -----
        The imputed values are stored in the class attribute `self.data`.
        N)rangerD   updater3   r4   append)r
   n_iterkrI   hvs        r   rN   zMICEData.update_all  s     v 	# 	#A* # #E""""#  ,&&t,,BL##### -,r   c                 *   | j         |         }t          j        || j        d          \  }}| j        |         }t          j        |j        |         d          }t          j        |j        |ddf         d          }| j        |         }t          j        |j        |ddf         d          }	i }
|| j	        v r#| j	        |         }| 
                    ||          }
i }|| j	        v r#| j	        |         }| 
                    ||          }|||	|
|fS )aJ  
        Return endog and exog for imputation of a given variable.

        Parameters
        ----------
        vname : str
           The variable for which the split data is returned.

        Returns
        -------
        endog_obs : DataFrame
            Observed values of the variable to be imputed.
        exog_obs : DataFrame
            Current values of the predictors where the variable to be
            imputed is observed.
        exog_miss : DataFrame
            Current values of the predictors where the variable to be
            Imputed is missing.
        init_kwds : dict-like
            The init keyword arguments for `vname`, processed through Patsy
            as required.
        fit_kwds : dict-like
            The fit keyword arguments for `vname`, processed through Patsy
            as required.
        	dataframereturn_typeWrequirementsN)r:   patsy	dmatricesr2   r6   r,   requireri   r#   r5   _process_kwds)r
   rI   r	   endogexogixo	endog_obsexog_obsixm	exog_misspredict_obs_kwdskwdspredict_miss_kwdss                r   get_split_datazMICEData.get_split_data  s5   6 *51ogty2=? ? ?t k% Juz#SAAA	:diQQQ/cBBB l5!Jtyaaa0sCCC	D%%%$U+D#11$<<D%%%$U+D $ 2 24 = =8Y0@!# 	#r   c                 8   |                                 }|D ]}||         }t          |t                    rct          j        |j        | j        d          }t          j        |d          |d d f         }|j	        d         dk    r|d d df         }|||<   |S )Nru   rv   rx   ry   rM   r   )
copy
isinstancer   r{   dmatrixr	   r2   r,   r}   shape)r
   r   rY   rr   r%   mats         r   r~   zMICEData._process_kwds  s    yy{{ 	 	AQA!\** mAIty0;= = =j3777AAA>9Q<1$$aaad)CQr   c                    | j         |         }| j        |         }t          j        || j        d          \  }}t          j        |j        |df         d          }t          j        |j        |ddf         d          }|                     | j	        |         |          }|                     | j
        |         |          }||||fS )a  
        Return the data needed to fit a model for imputation.

        The data is used to impute variable `vname`, and therefore
        only includes cases for which `vname` is observed.

        Values of type `PatsyFormula` in `init_kwds` or `fit_kwds` are
        processed through Patsy and subset to align with the model's
        endog and exog.

        Parameters
        ----------
        vname : str
           The variable for which the fitting data is returned.

        Returns
        -------
        endog : DataFrame
            Observed values of `vname`.
        exog : DataFrame
            Regression design matrix for imputing `vname`.
        init_kwds : dict-like
            The init keyword arguments for `vname`, processed through Patsy
            as required.
        fit_kwds : dict-like
            The fit keyword arguments for `vname`, processed through Patsy
            as required.
        ru   rv   r   rx   ry   N)r6   r:   r{   r|   r2   r,   r}   ri   r~   r;   r<   )r
   rI   rY   r	   r   r   r;   r<   s           r   get_fitting_datazMICEData.get_fitting_data  s    > [*51ogty2=? ? ?t 
5:b!e,3???z$)BE*===&&t~e'<bAA	%%dmE&:B??dIx//r   patternTc                    t          j        | j        j                  }| j        j        t                    D ]\  }}	| j        |	         }
d||
|f<   |dk    r(t          j        |                    d                    }
n|dk    r\t          j	        |j
                  }t           j                            |d          \  }}}t          j        |dddf                   }
n:|dk    r"t          j        t                              }
nt          |dz             |dd|
f         }fd|
D             |dk    r(t          j        |                    d                    }
n|dk    rLd	t          j        |j        d                   z  }t          j        ||          }t          j        |          }
n8|dk    r t          j        |j        d                   }
nt          |d
z             ||
ddf         }|r7t          j        |dk                        d                    }
||
ddf         }|rEt          j        |dk                        d                    }
|dd|
f         }fd|
D             ddlm} ddlm} ||                    |          \  }}n|                                }|r|d	t          j        |j        d                   z  }t          j        ||          }t          j        |d          \  }}|d|dddf         z   z  }|                    |ddd           n1|                    dddg          }|                    |dd|           |                    d           |                    t=          t                                         |                    d           |S )av  
        Generate an image showing the missing data pattern.

        Parameters
        ----------
        ax : AxesSubplot
            Axes on which to draw the plot.
        row_order : str
            The method for ordering the rows.  Must be one of 'pattern',
            'proportion', or 'raw'.
        column_order : str
            The method for ordering the columns.  Must be one of 'pattern',
            'proportion', or 'raw'.
        hide_complete_rows : bool
            If True, rows with no missing values are not drawn.
        hide_complete_columns : bool
            If True, columns with no missing values are not drawn.
        color_row_patterns : bool
            If True, color the unique row patterns, otherwise use grey
            and white as colors.

        Returns
        -------
        A figure containing a plot of the missing data pattern.
        rM   
proportionr   r   Nrawz, is not an allowed value for `column_order`.c                      g | ]
}|         S r   r   r$   r(   colss     r   r&   z1MICEData.plot_missing_pattern.<locals>.<listcomp>?  s    $$$AQ$$$r      z) is not an allowed value for `row_order`.c                      g | ]
}|         S r   r   r   s     r   r&   z1MICEData.plot_missing_pattern.<locals>.<listcomp>V  s    (((DG(((r   utils)LinearSegmentedColormapT)return_inverseautonearestgist_ncar_r)aspectinterpolationcmap_whitedarkgreyCasesZ   )rotation) r,   zerosr2   r   r*   	enumerater#   rB   rR   covTlinalgsvdaranger"   r-   dotr]   anystatsmodels.graphicsr   matplotlib.colorsr   create_mpl_ax
get_figureuniqueimshow	from_list
set_ylabel
set_xticksrn   set_xticklabels)r
   ax	row_ordercolumn_orderhide_complete_rowshide_complete_columnscolor_row_patternsmissjrH   rY   cvusvtrb   rkygutilsr   figr   rcolr   r   s                          @r   plot_missing_patternzMICEData.plot_missing_pattern  s   @ x	((y oo 	 	FAsc"BDQKK <''DIIaLL))BBY&&By}}R++HAq"Bqqq!tH%%BBU""3t99%%BBMMO O OAAArE{$$$$$$$ $$DIIaLL))BB)##29TZ]+++A&q//CCBB%4:a=))BBGGI I IBE{ 	 2 233BAAA;D  	) 2 233B2;D((((R(((D888888======:**2..GC--//C 	!29TZ]+++A&q//CiD999GAtAQQQW%%DIId6(  * * * * +44S6=z5JL LDIId6  ! ! ! 	g
eCII&&'''
4"---
r   (   c           
         ddl m} ddlm}	 |i }||                    |          \  }
}n|                                }
|                    g d           | j        |         }| j        |         }| j        |         }| j        |         }t          j
        ||          }t          j
        ||          }t          j
        ||          }t          j
        ||          }t          j        | j        |         d          }t          j        | j        |         d          }|t          j        |          r||f}||d         t          j                            t!          |                    z  z  }||d	         t          j                            t!          |                    z  z  }g d
}ddd}||||d}ddddd}|rZ|D ]W}||         }||d                  dz   ||d	                  z   }|                    ||         ||         d||         |d           X|D ]}||         }t!          |          |k     r||v r	||         }ni }||         } |	||         ||         fi |}|r6|                    |dddf         |ddd	f         d||         dd           ||d                  dz   ||d	                  z   }|                    |dddf         |ddd	f         d||         dd|           |                                \  }}|rdnd}|
                    ||dd	|          } |                     d            |                    |           |                    |           |
S )!a  
        Plot observed and imputed values for two variables.

        Displays a scatterplot of one variable against another.  The
        points are colored according to whether the values are
        observed or imputed.

        Parameters
        ----------
        col1_name : str
            The variable to be plotted on the horizontal axis.
        col2_name : str
            The variable to be plotted on the vertical axis.
        lowess_args : dictionary
            A dictionary of dictionaries, keys are 'ii', 'io', 'oi'
            and 'oo', where 'o' denotes 'observed' and 'i' denotes
            imputed.  See Notes for details.
        lowess_min_n : int
            Minimum sample size to plot a lowess fit
        jitter : float or tuple
            Standard deviation for jittering points in the plot.
            Either a single scalar applied to both axes, or a tuple
            containing x-axis jitter and y-axis jitter, respectively.
        plot_points : bool
            If True, the data points are plotted.
        ax : AxesSubplot
            Axes on which to plot, created if not provided.

        Returns
        -------
        The matplotlib figure on which the plot id drawn.
        r   r   lowessN皙?r   gffffff?g?rx   ry   sizerM   )ooiooirK   impobs)r(   o)rK   r   r   r   greyredorangelime)r   rK   r   r   /r   333333?colorlabelalpha-   )r   r   lwr   r   r   r   g-C6?g      ?center right)rU   	numpointshandletextpadF)r   r   *statsmodels.nonparametric.smoothers_lowessr   r   r   set_positionr#   r6   r,   intersect1dr}   r2   isscalarrandomnormalr"   plotget_legend_handles_labelslegend
draw_frame
set_xlabelr   )!r
   	col1_name	col2_namelowess_argslowess_min_njitterplot_pointsr   r   r   r   ix1iix1oix2iix2oix_iiix_ioix_oiix_oovec1vec2keyslakixsr   kyrY   lablalfithapadlegs!                                    r   plot_bivariatezMICEData.plot_bivariates  s   H 	988888EEEEEEK:**2..GC--//C
,,,---|I&{9%|I&{9%tT**tT**tT**tT**z$)I.SAAAz$)I.SAAA{6"" * &)F1I	 0 0c$ii 0 @ @@@DF1I	 0 0c$ii 0 @ @@@D (''&&%uEBBU(  	. . .W"Q%j3&RU3R$r(CuRy!  . . . .  	4 	4BRB2ww%%[   _RB6$r(DH3333D 4QQQT
DAJ59!a  ) ) ) ) "Q%j3&RU3QQQT
DAJ59!as  4 4 4 4 --//B#,ffjjR^q'*  , ,u
i   
i   
r   c                    ddl m} ddlm} |i }||                    |          \  }	}n|                                }	|                    g d           | j        |         }
| j        |         }t          j
        | j        |         d          }| j        |         }t          j        || j        d	          \  }}| j        |         }|                    |
          }|                     |          }|t          j        |          r||f}||d         t          j                            t+          |                    z  z  }||d         t          j                            t+          |                    z  z  }ddg}||
d}ddd}ddd}|r@|D ]=}||         }|                    ||         ||         d||         ||         d           >|D ]}||         }t+          |          |k     r||v r	||         }ni }||         } |||         ||         fi |}|                    |dddf         |dddf         d||         dd||                    |                                \  }}|	                    ||dd          }|                    d           |                    |dz              |                    |dz              |	S )a  
        Plot fitted versus imputed or observed values as a scatterplot.

        Parameters
        ----------
        col_name : str
            The variable to be plotted on the horizontal axis.
        lowess_args : dict-like
            Keyword arguments passed to lowess fit.  A dictionary of
            dictionaries, keys are 'o' and 'i' denoting 'observed' and
            'imputed', respectively.
        lowess_min_n : int
            Minimum sample size to plot a lowess fit
        jitter : float or tuple
            Standard deviation for jittering points in the plot.
            Either a single scalar applied to both axes, or a tuple
            containing x-axis jitter and y-axis jitter, respectively.
        plot_points : bool
            If True, the data points are plotted.
        ax : AxesSubplot
            Axes on which to plot, created if not provided.

        Returns
        -------
        The matplotlib figure on which the plot is drawn.
        r   r   r   Nr   rx   ry   ru   rv   )r   r   rM   r   r(   )r   r(   r   r   r   r   r   r   r   r   r   r   rU   r   Fz observed or imputedz fitted)r   r   r   r   r   r   r   r#   r6   r,   r}   r2   r:   r{   r|   r9   predict_get_predictedr   r   r   r"   r   r   r   r   r   r   )r
   col_namer   r   r   r   r   r   r   r   ixir   r  r	   r   r   r9   r  r	  r  r
  r   r  rY   r  r  r  r  s                               r   plot_fit_obszMICEData.plot_fit_obs  sO   < 	988888EEEEEEK:**2..GC--//C
,,,---l8$k(#z$)H-C@@@ *84ogty2=? ? ?t,x(D))""4(({6"" * &)F1I	 0 0c$ii 0 @ @@@DF1I	 0 0c$ii 0 @ @@@D Szc""&&V,, 	2 2 2WR$r(CuRy!"gS  2 2 2 2  	4 	4BRB2ww%%[   _RB6$r(DH3333DGGDAJQQQT
CuRy!3r7  4 4 4 4 --//BjjR^qjAAu
h!77888
h*+++
r   c                    ddl m} |i }|i }|i }||                    |          \  }}n|                                }|                    g d           | j        |         }| j        |         }	| j        |         j        |         }
| j        |         j        |	         }|||fD ]}d|vrd|d<   g g }}t          |
          dk    rV |j
        t          j        |
          fi |}|                    |d         d                    |                    d            |j
        t          j        |          fi |} |j
        t          j        | j        |                   fi |}|                    |d         d         |d         d         g           |                    d	d
g           |                    ||dd          }|                    d           |                    |           |                    d           |S )aO  
        Display imputed values for one variable as a histogram.

        Parameters
        ----------
        col_name : str
            The name of the variable to be plotted.
        ax : AxesSubplot
            An axes on which to draw the histograms.  If not provided,
            one is created.
        imp_hist_args : dict
            Keyword arguments to be passed to pyplot.hist when
            creating the histogram for imputed values.
        obs_hist_args : dict
            Keyword arguments to be passed to pyplot.hist when
            creating the histogram for observed values.
        all_hist_args : dict
            Keyword arguments to be passed to pyplot.hist when
            creating the histogram for all values.

        Returns
        -------
        The matplotlib figure on which the histograms were drawn
        r   r   Nr   histtypestepImpObsAllr   rM   r  F	Frequency)r   r   r   r   r   r#   r6   r2   ri   r"   histr,   rA   rp   extendr   r   r   r   )r
   r  r   imp_hist_argsobs_hist_argsall_hist_argsr   r   r   r   r   r   rX   r  r  hh1h2r  s                      r   plot_imputed_histzMICEData.plot_imputed_histB  s'   6 	988888 M M M:**2..GC--//C
,,,---l8$k(#i!&s+i!&s+= 	( 	(B##!':RBs88a<<
399=99AIIaeAhIIeRWRZ__6666RWRZ	( 344FFFF
		2b6!9bfQi()))
		5%.!!!jjR^qjAAu
h
k"""
r   c                 .   |D ]}||         }t          |t          j                  s%|j        dk    r)|j        d         t          |          k    r||         ||<   |j        dk    r-|j        d         t          |          k    r||d d f         ||<   |S )NrM   r   r   )r   r,   ndarrayndimr   r"   )r
   r   rixrr   r%   s        r   
_boot_kwdszMICEData._boot_kwds  s     	$ 	$AQA a,,  !!'!*C"8"8C&Q !!'!*C"8"8CF)Qr   c                 :   |                      |          \  }}}}t          |          }t          j                            d||          }||         }||ddf         }|                     ||          }|                     ||          }| j        |         } |||fi || j        |<   || j        v r.| j        |         r! | j        |         j	        di || j
        |<   n  | j        |         j        di || j
        |<   | j
        |         j        | j        |<   dS )zD
        Perturbs the model's parameters using a bootstrap.
        r   Nr   )r   r"   r,   r   randintr0  r=   r8   r/   fit_regularizedr9   fitr>   )	r
   rI   r   r   r;   r<   mr/  klasss	            r   _perturb_bootstrapzMICEData._perturb_bootstrap  s<   
 ,0+@+@+G+G(tYJJi1a((c
CF|OOIs33	??8S11 '"U5$<<)<<ED$$$)9%)@$2E"2>>X>> L #9$+e"4"8"D"D8"D"DDL!\%07Er   c                 l   |                      |          \  }}}}| j        |         } |||fi || j        |<    | j        |         j        di || j        |<   | j        |                                         }| j        |         j        }t          j        	                    ||          | j        |<   dS )z
        Gaussian perturbation of model parameters.

        The normal approximation to the sampling distribution of the
        parameter estimates is used to define the mean and covariance
        structure of the perturbation distribution.
        )rR   r   Nr   )
r   r=   r8   r4  r9   
cov_paramsr>   r,   r   multivariate_normal)	r
   rI   r   r   r;   r<   r6  r   mus	            r   _perturb_gaussianzMICEData._perturb_gaussian  s     ,0+@+@+G+G(tY '"U5$<<)<<E4dk%04@@x@@Ul5!,,..\% 'Y:::LLEr   c                     | j         |         dk    r|                     |           d S | j         |         dk    r|                     |           d S t          d          )Nr   bootzunknown perturbation method)r   r<  r7  r-   r
   rI   s     r   perturb_paramszMICEData.perturb_params  sk    #E*j88""5)))))%e,66##E*****:;;;r   c                 0    |                      |           d S r   )
impute_pmmr?  s     r   imputezMICEData.impute  s     	r   c                 Z    |                      |           |                     |           dS )a.  
        Impute missing values for a single variable.

        This is a two-step process in which first the parameters are
        perturbed, then the missing values are re-imputed.

        Parameters
        ----------
        vname : str
            The name of the variable to be updated.
        N)r@  rC  r?  s     r   ro   zMICEData.update  s0     	E"""Er   c                     t          |t          j                  r|S t          |t          j                  r|j        S t          |d          r|j        S t          d|j	        z            )Npredicted_valuesz&cannot obtain predicted values from %s)
r   r,   r-  r[   SeriesvalueshasattrrF  r-   	__class__)r
   objs     r   r  zMICEData._get_predicted  su    c2:&& 	JJRY'' 	J:S,-- 	J''83=HJ J Jr   c                    | j         }|                     |          \  }}}}}| j        |         } |j        | j        |         |fi |}	 |j        | j        |         |fi |}
|                     |	          }	|                     |
          }
t          j        |	          }||         }|	|         }	t          j        |	|
          }|dddf         t          j	        | |          dddf         z   }t          j
        |dk     |t          |          dz
  k    z            }t          j        |dt          |          dz
            }|
dddf         |	|         z
  }t          j        |          }t          j        ||<   t          j        |d          ddd|f         }t          j                            d|t          |
                    }t          j	        |j        d                   }|||f         }|||f         }t          j        ||                                                   }|                     ||           dS )z
        Use predictive mean matching to impute missing values.

        Notes
        -----
        The `perturb_params` method must be called first to define the
        model.
        Nr   rM   )rF   r   r8   r  r>   r  r,   rB   searchsortedr   nonzeror"   cliprS   infr   r2  r   arraysqueezerl   )r
   rI   rF   r   r   r   r   r   model
pendog_obspendog_missrK   rY   r   mskdxdxiirjjizimputed_misss                        r   rB  zMICEData.impute_pmm  sU    
 && 	L	8Y(8:K
 E""U]4;u#5x 7 7%57 7
#emDK$6	 9 9&79 9 ((44
))+66 Z
##bM	^
 _Z55 DkBIufe44T111W== j#'cC	NNQ,>&>?@@gc1c)nnq011 D!JsO3VBZZ&3 jQ1U7
+ Yq%[)9)9:: Ysy|$$"b]"b]x	"..6688E<00000r   )r   r   N)NNNNNr   NF)rM   )Nr   r   FFT)Nr   NTN)NNNN)r   r   r   formatr   r   r   rO   rE   r7   r?   rl   rN   r   r~   r   r   r  r  r+  r0  r7  r<  r@  rC  ro   r  rB  r   r   r   r   r      s       &L 	$899M P 2<,0@ @ @ @D  ,3 3 3"   AE@DDIK3 K3 K3 K3ZU U U $ $ $ $.3# 3# 3#j  +0 +0 +0Z 7@*3053804	e e e eP 799=m m m m^ 26-1*.^ ^ ^ ^@ BF<@E E E ER  &8 8 84M M M&< < <  
  "
J 
J 
J=1 =1 =1 =1 =1r   r   z
    >>> imp = mice.MICEData(data)
    >>> fml = 'y ~ x1 + x2 + x3 + x4'
    >>> mice = mice.MICE(fml, sm.OLS, imp)
    >>> results = mice.fit(10, 10)
    >>> print(results.summary())

    .. literalinclude:: ../plots/mice_example_1.txt
    z
    >>> imp = mice.MICEData(data)
    >>> fml = 'y ~ x1 + x2 + x3 + x4'
    >>> mice = mice.MICE(fml, sm.OLS, imp)
    >>> results = []
    >>> for k in range(10):
    >>>     x = mice.next_sample()
    >>>     results.append(x)
    c                   \    e Zd Zd                    ee          Z	 	 d
dZd ZddZ	d	 Z
dS )MICEa      Multiple Imputation with Chained Equations.

    This class can be used to fit most statsmodels models to data sets
    with missing values using the 'multiple imputation with chained
    equations' (MICE) approach..

    Parameters
    ----------
    model_formula : str
        The model formula to be fit to the imputed data sets.  This
        formula is for the 'analysis model'.
    model_class : statsmodels model
        The model to be fit to the imputed data sets.  This model
        class if for the 'analysis model'.
    data : MICEData instance
        MICEData object containing the data set for which
        missing values will be imputed
    n_skip : int
        The number of imputed datasets to skip between consecutive
        imputed datasets that are used for analysis.
    init_kwds : dict-like
        Dictionary of keyword arguments passed to the init method
        of the analysis model.
    fit_kwds : dict-like
        Dictionary of keyword arguments passed to the fit method
        of the analysis model.

    Examples
    --------
    Run all MICE steps and obtain results:
    {mice_example_1}

    Obtain a sequence of fitted analysis models without combining
    to obtain summary::
    {mice_example_2}
    )mice_example_1mice_example_2   Nc                 x    || _         || _        || _        || _        g | _        ||ni | _        ||ni | _        d S r   )model_formular=   n_skipr2   results_listr;   r<   )r
   rd  r=   r2   re  r;   r<   s          r   r   zMICE.__init__t  sN     +&	&/&;$,$8br   c                 N   | j                             | j        dz              d}t          | j                  dk    r| j        d         j        } | j        j        | j        | j         j         fi | j	        }| j
                            d|i            |j        di | j
        }|S )a`  
        Perform one complete MICE iteration.

        A single MICE iteration updates all missing values using their
        respective imputation models, then fits the analysis model to
        the imputed data.

        Returns
        -------
        params : array_like
            The model parameters for the analysis model.

        Notes
        -----
        This function fits the analysis model and returns its
        parameter estimate.  The parameter vector is not stored by the
        class and is not used in any subsequent calls to `combine`.
        Use `fit` to run all MICE steps together and obtain summary
        results.

        The complete cycle of missing value imputation followed by
        fitting the analysis model is repeated `n_skip + 1` times and
        the analysis model parameters from the final fit are returned.
        rM   Nr   r  start_paramsr   )r2   rN   re  r"   rf  r>   r=   from_formulard  r;   r<   ro   r4  )r
   rh  rS  results       r   rO   zMICE.next_sample  s    6 		T[1_---t !!A%%,R07L . -d.@.2in@ @04@ @ 	nl;<<<++T]++r   
   c                 "   | j                             |           t          |          D ]0}|                                 }| j                            |           1|j        j        | _        |j        j        | _        | 	                                S )z
        Fit a model using MICE.

        Parameters
        ----------
        n_burnin : int
            The number of burn-in cycles to skip.
        n_imputations : int
            The number of data sets to impute
        )
r2   rN   rn   rO   rf  rp   rS  endog_names
exog_namescombine)r
   n_burninn_imputationsr   rj  s        r   r4  zMICE.fit  s     		X&&&}%% 	- 	-A%%''F$$V,,,,!<3 ,1||~~r   c                    g }d}g }| j         D ]T}|j        }|                    |j                   ||                                z  }|                    |j                   Ut          j        |          }t          j        |          }|                    d          }|t          | j                   z  }t          j
        |j                  }ddt          t          | j                             z  z   }|||z  z   }	|t          j        |          z  t          j        |	          z  }
t          j        |          }t          | ||	|z            }||_        |
|_        | j        |_        | j        |_        | j        |_        |S )a   
        Pools MICE imputation results.

        This method can only be used after the `run` method has been
        called.  Returns estimates and standard errors of the analysis
        model parameters.

        Returns a MICEResults instance.
        g        r   rM   )rf  _resultsrp   r>   r9  scaler,   rA   rR   r"   r   r   floatdiagMICEResultsfrac_miss_inforn  rm  r=   )r
   params_list
cov_within
scale_listr9   
results_uwr>   cov_betweenfr9  fmirt  s               r   ro  zMICE.combine  s    

( 	- 	-G )Jz0111*//111Jgm,,,,j--Z
++
 !!!$$ 	c$+,,,
 f[]++ E#d/001111!k/1
 "'+&&&)<)<< 
##dFJ,>??!$!_".".r   )rb  NN)rk  rk  )r   r   r   r]  _mice_example_1_mice_example_2r   r   rO   r4  ro  r   r   r   r_  r_  K  s        $H 	o) 	 	+ 	+I N AB*.
A 
A 
A 
A' ' 'R   01 1 1 1 1r   r_  c                   &     e Zd Z fdZddZ xZS )rw  c                 N    t                                          |||           d S r   )superr   )r
   rS  r>   normalized_cov_paramsrJ  s       r   r   zMICEResults.__init__  s5    *?	A 	A 	A 	A 	Ar   N皙?c                    ddl m} |                                }d}i }d|d<   | j        j        |d<   | j        |d<   d| j        j        j        j        d         z  |d	<   d
| j	        z  |d<   dt          | j        j                  z  |d<   |                    |d|           |                    | |          }| j        |d<   |                    ||           |                    ||            |S )a  
        Summarize the results of running MICE.

        Parameters
        ----------
        title : str, optional
            Title for the top table. If not None, then this replaces
            the default title
        alpha : float
            Significance level for the confidence intervals

        Returns
        -------
        smry : Summary instance
            This holds the summary tables and text, which can be
            printed or converted to various output formats.
        r   )summary2z%8.3fr_  zMethod:zModel:zDependent variable:z%dzSample size:z%.2fScalezNum. imputationsl)alignfloat_format)r   FMI)r  )titler9   )statsmodels.iolibr  Summaryr=   r   rm  rS  r2   r   rt  r"   rf  add_dictsummary_paramsrx  add_df	add_title)r
   r  r   r  smryr  infoparams           r   summaryzMICEResults.summary  s   & 	/.....!! Y)2X&*&6"##djo&:&@&CC^+W#'#dj.E*F*F#F d#LAAA''E'::*eE555UD111r   )Nr  )r   r   r   r   r  __classcell__)rJ  s   @r   rw  rw    sQ        A A A A A
( ( ( ( ( ( ( (r   rw  )r   pandasr[   numpyr,   r{   statsmodels.base.modelr   #statsmodels.regression.linear_modelr   collectionsr   r   r   r   r  r  r_  rw  r   r   r   <module>r     s@  s sj          9 9 9 9 9 9 3 3 3 3 3 3 # # # # # #3 ( ( ( ( ( ( ( (e1 e1 e1 e1 e1 e1 e1 e1Pg g g g g g g gT/ / / / /( / / / / /r   