directory = pathlib.Path.cwd().parent / 'samples'
assert directory.exists()
directoryPosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')
Directory where the data (XML files) are stored
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')
A (sample) file in that directory
xml_file = directory / 'PlataformasAgregadasSinMenores_20220104_030016_1.atom'
assert xml_file.exists()
xml_filePosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_20220104_030016_1.atom')
XML file is read into a hierarchical (column-multiindex) pd.DataFrame. CAVEAT: a hierarchical pd.DataFrame is assumed.
| id | summary | title | updated | ContractFolderStatus | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ContractFolderID | ContractFolderStatusCode | LocatedContractingParty | ProcurementProject | ... | TechnicalDocumentReference | ProcurementProject | LocatedContractingParty | TenderingProcess | TenderResult | TenderingProcess | |||||||||||
| BuyerProfileURIID | Party | ParentLocatedParty | Name | ... | ID | Attachment | PlannedPeriod | Party | ParentLocatedParty | ParticipationRequestReceptionPeriod | AwardedTenderedProject | TenderSubmissionDeadlinePeriod | |||||||||
| PartyName | PartyName | ... | ExternalReference | StartDate | EndDate | PartyIdentification | ParentLocatedParty | EndDate | EndTime | ProcurementProjectLotID | |||||||||||
| Name | Name | ... | URI | ID | PartyName | ||||||||||||||||
| ... | Name | ||||||||||||||||||||
| 0 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: C. 2-2021; Órgano de Contrataci... | L'objecte del contracte és la renovació de tot... | 2022-01-03 00:11:41.826000+00:00 | C. 2-2021 | ADJ | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Sant Ramon | Entitats municipals de Catalunya | L'objecte del contracte és la renovació de tot... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2021-12-17 14:00:00+00:00 |
| 1 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8128_3/2021; Órgano de Contrata... | Obras de restauración hidromorfológica del río... | 2022-01-03 00:00:11.194000+00:00 | 8128_3/2021 | PUB | <NA> | Pleno del Ayuntamiento | AYUNTAMIENTO DE MONREAL | Obras de restauración hidromorfológica del río... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-22 23:30:00+00:00 |
| 2 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1000_0005-CP01-2021-000063; Órg... | Contrato del servicio de realización de labore... | 2022-01-03 00:00:10.399000+00:00 | 1000_0005-CP01-2021-000063 | EV | <NA> | El Director General de Comunicación y Relacion... | Departamento de Presidencia, Igualdad, Función... | Contrato del servicio de realización de labore... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
| 3 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1379/2020 4738; Órgano de Contr... | Obres de renovació de l'enllumenat públic a la... | 2022-01-02 23:11:40.740000+00:00 | 1379/2020 4738 | EV | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Canet de Mar | Entitats municipals de Catalunya | Obres de renovació de l'enllumenat públic a la... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
| 4 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2021-44; Órgano de Contratación... | Subministre i la instal·lació fotovoltaica en ... | 2022-01-02 23:11:40.696000+00:00 | 2021-44 | EV | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Valls | Entitats municipals de Catalunya | Subministre i la instal·lació fotovoltaica en ... | ... | Enllac plec clausules tecniques.doc | https://contractaciopublica.gencat.cat/ecofin_... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
5 rows × 39 columns
The column to discriminate regions
It is turned
domain_discriminative_columns = [sproc.hier.pad_col_levels(df, p) for p in domain_discriminative_columns_paths]
domain_discriminative_columns[('ContractFolderStatus',
'LocatedContractingParty',
'BuyerProfileURIID',
'',
'',
''),
('ContractFolderStatus',
'LegalDocumentReference',
'Attachment',
'ExternalReference',
'URI',
'')]
Unique values of the first one
['https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=2763318',
<NA>,
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=29178875',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=8530338',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=16054009']
pd.NA values are filtered out
['https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=2763318',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=29178875',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=8530338',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=16054009',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=15937468']
A one-liner returning a pd.DataFrame of columns with parsed domains
domains = df[domain_discriminative_columns].applymap(lambda x: urllib.parse.urlparse(x).netloc if pd.notna(x) else pd.NA)
domains| ContractFolderStatus | ||
|---|---|---|
| LocatedContractingParty | LegalDocumentReference | |
| BuyerProfileURIID | Attachment | |
| ExternalReference | ||
| URI | ||
| 0 | contractaciopublica.gencat.cat | <NA> |
| 1 | <NA> | <NA> |
| 2 | <NA> | <NA> |
| 3 | contractaciopublica.gencat.cat | contractaciopublica.gencat.cat |
| 4 | contractaciopublica.gencat.cat | contractaciopublica.gencat.cat |
| ... | ... | ... |
| 112 | NaN | NaN |
| 113 | NaN | NaN |
| 114 | NaN | NaN |
| 115 | NaN | NaN |
| 116 | NaN | NaN |
117 rows × 2 columns
How many non-nulls are in every column?
ContractFolderStatus LocatedContractingParty BuyerProfileURIID 42
LegalDocumentReference Attachment ExternalReference URI 70
dtype: int64
What about the combination?
domains[domain_discriminative_columns[0]].combine_first(domains[domain_discriminative_columns[1]]).notna().sum()77
A function returning a pd.Series with domains
domain (df:pandas.core.frame.DataFrame)
Extract the (internet) domains from the given data
| Type | Details | |
|---|---|---|
| df | DataFrame | Input |
| Returns | Series | Domains |
The function adds a new column to the pd.DataFrame inplace
A function to parse a string containing either a year or a year and a month
year_and_maybe_month (s:str)
| Type | Details | |
|---|---|---|
| s | str | Raw date |
| Returns | datetime | Parsed date |
Only a year
A year and a month