
    M/PhF                     ^    d Z ddlmZ ddlZddlmZ  G d d          Z G d d          ZdS )	zL
Created on Sat May 19 15:53:21 2018

Author: Josef Perktold
License: BSD-3
    )defaultdictNSCADSmoothedc                       e Zd ZdZd ZdS )ScreeningResultsa  Results for Variable Screening

    Note: Indices except for exog_idx and in the iterated case also
    idx_nonzero_batches are based on the combined [exog_keep, exog] array.

    Attributes
    ----------
    results_final : instance
        Results instance returned by the final fit of the penalized model, i.e.
        after trimming exog with params below trimming threshold.
    results_pen : results instance
        Results instance of the penalized model before trimming. This includes
        variables from the last forward selection
    idx_nonzero
        index of exog columns in the final selection including exog_keep
    idx_exog
        index of exog columns in the final selection for exog candidates, i.e.
        without exog_keep
    idx_excl
        idx of excluded exog based on combined [exog_keep, exog] array. This is
        the complement of idx_nonzero
    converged : bool
        True if the iteration has converged and stopped before maxiter has been
        reached. False if maxiter has been reached.
    iterations : int
        number of iterations in the screening process. Each iteration consists
        of a forward selection step and a trimming step.
    history : dict of lists
        results collected for each iteration during the screening process
        'idx_nonzero' 'params_keep'].append(start_params)
            history['idx_added'].append(idx)

    The ScreeningResults returned by `screen_exog_iterator` has additional
    attributes:

    idx_nonzero_batches : ndarray 2-D
        Two-dimensional array with batch index in the first column and variable
        index withing batch in the second column. They can be used jointly as
        index for the data in the exog_iterator.
    exog_final_names : list[str]
        'var<bidx>_<idx>' where `bidx` is the batch index and `idx` is the
        index of the selected column withing batch `bidx`.
    history_batches : dict of lists
        This provides information about the selected variables within each
        batch during the first round screening
        'idx_nonzero' is based ond the array that includes exog_keep, while
        'idx_exog' is the index based on the exog of the batch.
    c                 8    || _          | j        j        di | d S )N )screener__dict__update)selfr
   kwdss      [/var/www/html/test/jupyter/venv/lib/python3.11/site-packages/statsmodels/base/_screening.py__init__zScreeningResults.__init__?   s*     $$t$$$$$    N)__name__
__module____qualname____doc__r   r	   r   r   r   r      s.        / /`% % % % %r   r   c                   B    e Zd ZdZ	 	 	 ddZdd	Zdd
Z	 	 ddZd ZdS )VariableScreeningaV  Ultra-high, conditional sure independence screening

    This is an adjusted version of Fan's sure independence screening.

    Parameters
    ----------
    model : instance of penalizing model
        examples: GLMPenalized, PoissonPenalized and LogitPenalized.
        The attributes of the model instance `pen_weight` and `penal` will be
        ignored.
    pen_weight : None or float
        penalization weight use in SCAD penalized MLE
    k_add : int
        number of exog to add during expansion or forward selection
        see Notes section for tie handling
    k_max_add : int
        maximum number of variables to include during variable addition, i.e.
        forward selection. default is 30
    threshold_trim : float
        threshold for trimming parameters to zero, default is 1e-4
    k_max_included : int
        maximum total number of variables to include in model.
    ranking_attr : str
        This determines the result attribute or model method that is used for
        the ranking of exog to include. The availability of attributes depends
        on the model.
        Default is 'resid_pearson', 'model.score_factor' can be used in GLM.
    ranking_project : bool
        If ranking_project is True, then the exog candidates for inclusion are
        first projected on the already included exog before the computation
        of the ranking measure. This brings the ranking measure closer to
        the statistic of a score test for variable addition.

    Notes
    -----
    Status: experimental, tested only on a limited set of models and
    with a limited set of model options.

    Tie handling: If there are ties at the decision threshold, then all those
    tied exog columns are treated in the same way. During forward selection
    all exog columns with the same boundary value are included. During
    elimination, the tied columns are not dropped. Consequently, if ties are
    present, then the number of included exog can be larger than specified
    by k_add, k_max_add and k_max_included.

    The screening algorithm works similar to step wise regression. Each
    iteration of the screening algorithm includes a forward selection step
    where variables are added to the model, and a backwards selection step
    where variables are removed. In contrast to step wise regression, we add
    a fixed number of variables at each forward selection step. The
    backwards selection step is based on SCAD penalized estimation and
    trimming of variables with estimated coefficients below a threshold.
    The tuning parameters can be used to adjust the number of variables to add
    and to include depending on the size of the dataset.

    There is currently no automatic tuning parameter selection. Candidate
    explanatory variables should be standardized or should be on a similar
    scale because penalization and trimming are based on the absolute values
    of the parameters.


    TODOs and current limitations:

    freq_weights are not supported in this. Candidate ranking uses
    moment condition with resid_pearson or others without freq_weights.
    pearson_resid: GLM resid_pearson does not include freq_weights.

    variable names: do we keep track of those? currently made-up names

    currently only supports numpy arrays, no exog type check or conversion

    currently only single columns are selected, no terms (multi column exog)
    NT   -C6?   resid_pearsonc
                     || _         |j        | _        |                                | _        | j                            dd            | j                            dd            |j        | _        |j        | _        |j        j	        d         | _
        t          | j                  | _        |                                 | _        ||| _        n| j        dz  | _        || _        || _        || _        || _        || _        || _        |	| _        d S )N
pen_weightpenal   
   )model	__class__model_class_get_init_kwds	init_kwdspopendogexog	exog_keepshapek_keeplennobs
_get_penalr   r   use_weightsk_add	k_max_addthreshold_trimk_max_includedranking_attrranking_project)
r   r!   r   r/   r0   r1   r2   r3   r4   r5   s
             r   r   zVariableScreening.__init__   s     
 ?--// 	<...7D)))[
j&q)
OO	__&&
!(DOO"i"nDO '
",,(.r   c                 &    t          dd|          S )z$create new Penalty instance
        g?r   )c0weightsr   )r   r8   s     r   r.   zVariableScreening._get_penal   s     CFG<<<<r   c                    | j         }| j        r|j        j        j        d         t          |          k    sJ |j        j        dd|f         }||                    t          j        	                    |                              |                    z
  }| j
        dk    rV|j                                        }|d|| <   |j                            |          }||z
  t          j        |          z  }n| j
        dd         dk    r| j
                            d          d         }	 t!          |j        |	          |j                  }|j        dk    r|dddf         }t          j        |                    |                    dz  }
n?t!          || j
                  }t          j        |                    |                    dz  }
|
S )	zBcompute measure for ranking exog candidates for inclusion
        r   Npredicted_poissonr      zmodel..   )r'   r5   r!   r(   r*   r,   dotnplinalgpinvr4   paramscopypredictsqrtsplitgetattrndimabs)r   res_penr(   keepr'   ex_inclp	predictedresid_factorattrmom_conds              r   ranking_measurez!VariableScreening.ranking_measure   s    
 	I=%+A.#d));;;;m(D1G'++binnW&=&=&A&A$&G&GHHHD 333
 ##%%A4%--a00I!I-1C1CCLLrr"h..$**3//2D777=$77GGL A%%+AAAqD1vl..t4455q8HH #7D,=>>Lvl..t4455q8Hr   d   bfgsFc                 
   | j         }|| j        }| j        }| j        }	|}
|j        d         }t          j        ||
f          }|j        \  }}||ni }ddd}|                    |           t          t                    }t          j
        |	t                    }t          j        |	t
          j                  }t          j
        |	|          } |||fi | j        }d|_         |j        di |}|j        }d}g }t%          |          D ]9}|dd|f         }
|                     ||
|          }t)          |          t)          |          k    sJ t          j        |          ddd	         }t-          | j        || j        z   t)          |          f          }||         }t          j        ||||k             f          }t          j        t)          |                    }||dt)          |          <   | j        r4t          j        t)          |                    } d| d|	<   | | j        _         |||dd|f         f| j        | j        d
| j        } |j        d||ddd|}t          j        |j                  | j        k    }|                                 | j!        k    rit          j        t          j        |j                            | j!                  }!t          j        |j                  |!k    }"t          j"        ||"          }d|d|	<   ||         }|rtG          |           tG          |           t)          |          }|j        |         }t          j        |tH                    }#d|#|<   t          j%        |#          d         }|d         &                    |           |d         &                    |           |d         &                    |           |d         &                    |           t)          |          t)          |          k    r||k    '                                rd} n|};t          j'        |d|	         t          j
        |	          k              sJ | j        r?t          j        t)          |                    } d| d|	<   | (                    |           }$n| j        }$ |||dd|f         f|$| j        d
| j        }% |%j        d||dd|}&d |D             }'|'|	d         |&j)        j*        |	d<   tW          | ||&|||	d         |	z
  ||||dz   	  	        }(|(S )a  screen and select variables (columns) in exog

        Parameters
        ----------
        exog : ndarray
            candidate explanatory variables that are screened for inclusion in
            the model
        endog : ndarray (optional)
            use a new endog in the screening model.
            This is not tested yet, and might not work correctly
        maxiter : int
            number of screening iterations
        method : str
            optimization method to use in fit, needs to be only of the gradient
            optimizers
        disp : bool
            display option for fit during optimization

        Returns
        -------
        res_screen : instance of ScreeningResults
            The attribute `results_final` contains is the results instance
            with the final model selection.
            `idx_nonzero` contains the index of the selected exog in the full
            exog, combined exog that are always kept plust exog_candidates.
            see ScreeningResults for a full description
        Nr      F)maxiterdisp)dtyper   )rK   )r   r   T)methodstart_paramswarn_convergenceskip_hessianidx_nonzerorK   params_keep	idx_added)r8   )r[   r\   r]   c                     g | ]}d |z  S )zvar%4dr	   ).0iis     r   
<listcomp>z1VariableScreening.screen_exog.<locals>.<listcomp>d  s    666B(R-666r   )results_penresults_finalr_   idx_exogidx_exclhistory	converged
iterationsr	   ),r#   r'   r)   r+   r*   r?   column_stackr   r   listarangeintonesbool_r%   r   fitrB   rangerR   r,   sortminr1   r0   concatenatezerosr/   r   r8   rI   r2   sumr3   logical_andprintboolnonzeroappendallr.   r!   
exog_namesr   ))r   r(   r'   rW   r[   rX   fit_kwdsr#   x0r+   x1	k_currentxr-   k_varsfkwdsrj   r_   rK   ri   mod_penrJ   r\   rk   idx_olditrQ   mcsidx_thr	thresholdidxstart_params2r8   thresh_paramskeep2	mask_exclr   	mod_final	res_finalxnamesress)                                            r   screen_exogzVariableScreening.screen_exog   s   : &=JE^HQK	 ORH%%wf$0b"E22d##ic222wvrx((9VV,,+eR::4>::'+))))~	.. ?	" ?	"B111h;B++GRd+CCHx==CMM1111'(##DDbD)C4>9tz+A3s88LMMGGI.+x98L/M!NOOCHSXX..M0<M,3|,,,- -'#c((++#$  &-
"!k%111c6 4$*-1_4 4$(N4 4G "gk ./<38t. . %-. .G
 6'.))D,??DxxzzD/// "w~(>(> ? ?9=9L8L!Nw~..>~dE22 !D&Md)K #dk""" K((I">$/L d333I%*Ik"z),,Q/HM"))+666FO""4(((M")),777K '',,,K  CLL00 G+0022 1 	!GG vk'6'*bi.?.??@@@@@ 	gc+..//G GGVGOOGO44EEJEKqK'8 2&++/?2 2 #'.2 2	
 "IM ./;38. . %-. .	
 76+666.4VWWo	"677+t-4/8-8*5fgg*>*G*2)0+4,.F	! 	! 	! 
r   c                    | j         }g }g }g }|D ]}|                     |d          }|                    |j                   |                    |dd|j        |d         |z
  f                    |                    |j        |d         |z
             t	          j        |          }|                     |d          }d t          |          D             }	d t          |          D             }
|j        |d         |z
  }t	          j        |	          |         }t	          j        |
          |         |_        ||_	        ||d}||_
        |S )ay  
        batched version of screen exog

        This screens variables in a two step process:

        In the first step screen_exog is used on each element of the
        exog_iterator, and the batch winners are collected.

        In the second step all batch winners are combined into a new array
        of exog candidates and `screen_exog` is used to select a final
        model.

        Parameters
        ----------
        exog_iterator : iterator over ndarrays

        Returns
        -------
        res_screen_final : instance of ScreeningResults
            This is the instance returned by the second round call to
            `screen_exog`. Additional attributes are added to provide
            more information about the batched selection process.
            The index of final nonzero variables is
            `idx_nonzero_batches` which is a 2-dimensional array with batch
            index in the first column and variable index within batch in the
            second column. They can be used jointly as index for the data
            in the exog_iterator.
            see ScreeningResults for a full description
        r   )rW   Nc                 ,    g | ]\  }}|D ]	}d ||fz  
S )zvar%d_%dr	   rc   bidxbatchr   s       r   re   z:VariableScreening.screen_exog_iterator.<locals>.<listcomp>  sJ     / / /!,u(-/ /!$ (4+5 / / / /r   c                 &    g | ]\  }}|D ]}||fS r	   r	   r   s       r   re   z:VariableScreening.screen_exog_iterator.<locals>.<listcomp>  sE     & & &#e$& & 3K & & & &r   )r_   rh   )r+   r   r~   r_   r?   rm   	enumeratearrayidx_nonzero_batchesexog_final_nameshistory_batches)r   exog_iteratorr+   res_idxexog_winnerexog_idxex
res_screenres_screen_finalexog_winner_namesidx_fullex_final_idxfinal_namesrj   s                 r   screen_exog_iteratorz&VariableScreening.screen_exog_iterators  s   <  	F 	FB))"b)99J NN:1222r!!!Z%;FGG%Dv%M"MNOOOOOJ2677;fDEEEEok22++K+DD/ /09(0C0C/ / /& &'0':':& & & (3FGG<vEh011,?/1x/A/A,/O,,7)")') )+2(r   )NTr   r   r   r   r   T)N)NrS   rT   FN)	r   r   r   r   r   r.   rR   r   r   r	   r   r   r   r   D   s        H HT HJCE?C/ / / /@= = = =
! ! ! !F AG)-Z Z Z Zx<  <  <  <  < r   r   )	r   collectionsr   numpyr?   statsmodels.base._penaltiesr   r   r   r	   r   r   <module>r      s     $ # # # # #     4 4 4 4 4 43% 3% 3% 3% 3% 3% 3% 3%lk  k  k  k  k  k  k  k  k  k r   