Creo que es básico saber cómo extraer el texto cuando tiene negrita o cursiva o otros colores pero veo que el resto tiene el mismo proble...

Gabriel Salvador

Pregunta

student•

Creo que es básico saber cómo extraer el texto cuando tiene negrita o cursiva o otros colores pero veo que el resto tiene el mismo problema sin resolver.

Sebastian Calderón Araque

student•

Encontré la siguiente solución usando axes xpath:

//div[@class="html-content"]/p/descendant-or-self::text()

Cesar Hernández Ramírez

student•

Interesante Gabriel, utilizaré tu solución! justo tenía esa duda

Gabriel Salvador

student•

Gracias por ayudar. Ya lo resolví. Si a alguien le sirve: Quería extraer noticias de elcomercio. com donde en el texto de las noticias a veces tenemos palabras en negrita usando la etiqueta <strong>. Si aplico lo que se ve mas adelante en el curso, el programa me va a devolver una lista de varias oraciones divididas cada que aparece la etiqueta <strong>.

Lo que hice fue quitar la función text() del comando xpath, teniendo:

XPATH_BODY = '//div[@class="entry__content"]/p'

luego, como el cuerpo de la noticia es:

body =  parsed.xpath(XPATH_BODY)

Para ver el texto sin negritas lo que hago es:

for i in body:
   text_body = text_body + i.text_content()

Y problema solucionado El código completo está aqui

import requests
import lxml.html as html # para aplicar Xpath a HTML
import os
import datetime
import nltk as nltk


HOME_URL = 'https://www.elcomercio.com/'


XPATH_LINK_TO_ARTICLE = '//h3[@class="article-highlighted__title"]/a/@href' #links of each of the news
XPATH_TITLE = '//h1[@class="entry__title"]/text()'
XPATH_BODY = '//div[@class="entry__content"]/p'

a = {}

def riqueza_lexica(texto):
  vocabulario= sorted(set(texto))
  return  len(vocabulario)/len(texto)
  
def parse_notice(link, today):
    try: 
        response =  requests.get(link)
        if response.status_code == 200:
            notice = response.content.decode('utf-8')#brings the html code from the website
            parsed = html.fromstring(notice)

            try:
                title =  parsed.xpath(XPATH_TITLE)[0]#extract title
                title = title.replace('\"', '')#deletes the character "
                title = title.replace('\'', '')#deletes the character "
                body =  parsed.xpath(XPATH_BODY)

                #for i in body:
                #    print(i.text_content())

            except IndexError:
                return
            text_body=''
            for i in body:
                text_body = text_body + i.text_content()

            a[title] = text_body
            
            '''
            with open(f'{today}/{title}.txt', 'w', encoding='utf-8') as f:
                f.write(title)
                f.write('\n\n')
                for p in body:
                    f.write(p.text_content())
                    f.write('\n')
            '''
                    
        else:
            raise ValueError(f'Error: {response.status_code}')
    except ValueError as ve:
        print(ve)
        


def parse_home():
    try:
        response = requests.get(HOME_URL)
        
        if response.status_code == 200:# Status code 200 means that everything is ok
            home = response.content.decode('utf-8')
            parsed = html.fromstring(home)
    
            links_to_notices = parsed.xpath(XPATH_LINK_TO_ARTICLE)
            #print(links_to_notices)

            today = datetime.date.today().strftime('%d-%m-%Y')
            if not os.path.isdir(today):
                #os.mkdir(today)#make a dir with the name of the day
                for link in links_to_notices:
                    parse_notice(link, today)



            
        else:
            raise ValueError(f"Error: {response.status_code}")


    except ValueError as ve: 
        print(ve)

def main():
    parse_home()
    text_to_analize=''
    for i in a.keys():
        text_to_analize=text_to_analize + a[i]
    print(text_to_analize)
    print(riqueza_lexica(text_to_analize))
    fdist=nltk.FreqDist(text_to_analize)
    print(fdist.most_common(20))



if __name__ == '__main__':
    main()

Héctor Eduardo López Carballo

student•

Hola!

Cuál es el problema que tienes? Podrías compartir más información? Por lo que entendí de xpath podrías usar

//text()

y eso te debería devolver cualquier texto dentro del contenedor en el que estés.

Creo que es básico saber cómo extraer el texto cuando tiene negrita o cursiva o otros colores pero veo que el resto tiene el mismo proble...

Curso de Web Scraping con Python y Xpath

Curso de Web Scraping con Python y Xpath